Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Consolidate IPoIB's private neighbour data handling into
ipoib_neigh_alloc() and ipoib_neigh_free(). This will make it easier
to keep track of the neighbour structures that IPoIB is handling, and
is a nice cleanup of the code:
add/remove: 2/1 grow/shrink: 1/8 up/down: 100/-178 (-78)
function old new delta
ipoib_neigh_alloc - 61 +61
ipoib_neigh_free - 36 +36
ipoib_mcast_join_finish 1288 1291 +3
path_rec_completion 575 573 -2
ipoib_mcast_join_task 664 660 -4
ipoib_neigh_destructor 101 92 -9
ipoib_neigh_setup_dev 14 3 -11
ipoib_neigh_setup 17 - -17
path_free 238 215 -23
ipoib_mcast_free 329 306 -23
ipoib_mcast_send 718 684 -34
neigh_add_path 705 650 -55
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the mthca debugging trace output code so that it can enabled
and disabled at runtime with the debug_level module parameter in
sysfs. Also, don't allow CONFIG_INFINIBAND_MTHCA_DEBUG to be disabled
unless CONFIG_EMBEDDED is selected. We want users (and especially
distros) to have this turned on unless they really need to save space,
because by the time we want debugging output, it's usually too late to
rebuild a kernel.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't allow CONFIG_INFINIBAND_IPOIB_DEBUG to be disabled unless
CONFIG_EMBEDDED is selected. We want users (and especially distros)
to have this turned on unless they really need to save space, because
by the time we want debugging output, it's usually too late to rebuild
a kernel. The debugging output can be controlled at runtime via the
debug_level module parameter in sysfs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have seen the following OOPs in cancel_mads, when restarting opensm
multiple times:
Call Trace:
[<c010549b>] show_stack+0x9b/0xb0
[<c01055ec>] show_registers+0x11c/0x190
[<c01057cd>] die+0xed/0x160
[<c031b966>] do_page_fault+0x3f6/0x5d0
[<c010511f>] error_code+0x4f/0x60
[<f8ac4e38>] cancel_mads+0x128/0x150 [ib_mad]
[<f8ac2811>] unregister_mad_agent+0x11/0x130 [ib_mad]
[<f8ac2a12>] ib_unregister_mad_agent+0x12/0x20 [ib_mad]
[<f8b10f23>] ib_umad_close+0xf3/0x130 [ib_umad]
[<c0162937>] __fput+0x187/0x1c0
[<c01627a9>] fput+0x19/0x20
[<c0160f7a>] filp_close+0x3a/0x60
[<c0121ca8>] put_files_struct+0x68/0xa0
[<c0103cf7>] do_signal+0x47/0x100
[<c0103ded>] do_notify_resume+0x3d/0x40
[<c0103f9e>] work_notifysig+0x13/0x25
We traced this back to local_completions unlocking mad_agent_priv->lock
while still keeping a pointer into local_list. A later call to
list_del(&local->completion_list) would then corrupt the list.
To fix this, remove the entry from local_list after looking it up but
before releasing mad_agent_priv->lock, to prevent cancel_mads from
finding and freeing it.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Integrate the ipath core and OpenIB drivers into the kernel build
infrastructure. Add entry to MAINTAINERS.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath_verbs.c file implements the driver-specific components of the
kernel's Infiniband verbs layer.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Completion queues, local and remote memory keys, and memory region
support.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is an implementation of the Infiniband RC ("reliable connection")
protocol.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These files implement the Infiniband UC ("unreliable connection") and UD
("unreliable datagram") protocols.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These header files are used by the layered Infiniband driver.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The layering interfaces are used to implement the Infiniband protocols
and the ethernet emulation driver.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These files introduce a char device that userspace apps use to gain
direct memory-mapped access to the InfiniPath hardware, and routines for
pinning and unpinning user memory in cases where the hardware needs to
DMA into the user address space.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipathfs filesystem contains files that are not appropriate for
sysfs, because they contain binary data. The hierarchy is simple; the
top-level directory contains driver-wide attribute files, while numbered
subdirectories contain per-device attribute files.
Our userspace code currently expects this filesystem to be mounted on
/ipathfs, but a final location has not yet been chosen.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
EEPROM support, interrupt handling, statistics gathering, and write
combining management for x86_64.
A note regarding i2c: The Atmel EEPROM hardware we use looks like an
i2c device electrically, but is not i2c compliant at all from a
functional perspective. We tried using the kernel's i2c support to
talk to it, but failed.
Normal i2c devices have a single 7-bit or 10-bit i2c address that they
respond to. Valid 7-bit addresses range from 0x03 to 0x77. Addresses
0x00 to 0x02 and 0x78 to 0x7F are special reserved addresses
(e.g. 0x00 is the "general call" address.) The Atmel device, on the
other hand, responds to ALL addresses. It's designed to be the only
device on a given i2c bus. A given i2c device address corresponds to
the memory address within the i2c device itself.
At least one reason why the linux core i2c stuff won't work for this
is that it prohibits access to reserved addresses like 0x00, which are
really valid addresses on the Atmel devices.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipath_init_chip.c sets up an InfiniPath device for use.
ipath_diag.c permits userspace diagnostic tools to read and write a
chip's registers. It is different in purpose from the mmap interfaces
to the /sys/bus/pci resource files.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This file contains routines and definitions specific to InfiniPath
devices that have PCI Express interfaces.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath_ht400.c file contains routines and definitions specific to
HyperTransport-based InfiniPath devices.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipath_common.h and ips_common.h contain definitions shared between
userspace and the kernel.
ipath_kernel.h is the core driver header file.
ipath_debug.h contains mask values used for controlling driver debugging.
ipath_registers.h contains bitmask definitions used in chip registers.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath driver is a low-level driver for PathScale InfiniPath host
channel adapters (HCAs) based on the HT-400 and PE-800 chips, including
the InfiniPath HT-460, the small form factor InfiniPath HT-460, the
InfiniPath HT-470 and the Linux Networx LS/X.
The ipath_driver.c file contains much of the low-level device handling
code.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add RMPP support for additional management classes that support it.
Also, validate RMPP is consistent with management class specified.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Received responses are currently matched against sent requests based
on TID only. According to the spec, responses should match based on
the combination of TID, management class, and requester LID/GID.
Without the additional qualification, an agent that is responding to
two requests, both of which have the same TID, can match RMPP ACKs
with the incorrect transaction. This problem can occur on the SM node
when responding to SA queries.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Quite a few cleanup functions in mthca were marked as __devexit.
However, they could also be called from error paths during
initialization, so they cannot be marked that way. Just delete all of
the incorrect annotations.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_hard_header() needs to handle the case that daddr is NULL. This
can happen when packets are injected via a raw socket, and IPoIB
shouldn't oops in this case.
Reported by Anton Blanchard <anton@samba.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The previous patch for Tavor broke MemFree logic.
The driver should perform limit check only for Tavor. For MemFree,
the check is incorrect, since ds (WQE stride) is always a power-of-2
(although the max_desc_size may not be).
In Tavor, however, WQE stride and desc_size are the same, and are not
necessarily power-of-2. The check was really for the WQE stride (and
it Tavor, we use max_desc_size for the stride).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The recently merged patch to create a fake scatterlist for non-SG SCSI
commands had a bug: the driver ended up doing dma_unmap_sg() on a
scatterlist scmnd->request_buffer rather than the fake scatter list it
created. Fix this so that the driver unmaps the same thing it maps.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch causes the network interface to respond to P_Key change
events correctly. As a result, you'll see a child interface in the
"RUNNING" state (netif_carrier_on()) only when the corresponding P_Key
is configured by the SM. When SM removes a P_Key, the "RUNNING" state
will be disabled for the corresponding network interface. To
implement this, I added IB_EVENT_PKEY_CHANGE event handling. To
prevent flushing the device before the device is open by the "delay
open" mechanism, I added an additional device flag called
IPOIB_FLAG_INITIALIZED.
This also prevents the child network interface from trying to join to
multicast groups until the PKEY is configured. We used to get error
messages like:
ib0.f2f2: couldn't attach QP to multicast group ff12:401b:f2f2:0:0:0:ffff:ffff
in this case. To fix this, I just check IPOIB_FLAG_OPER_UP flag in
ipoib_set_mcast_list().
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the call to mthca_MODIFY_QP() failed, then mthca_modify_qp() would
still do some things it shouldn't, such as store away attributes for
special QPs. Fix this, and simplify the code, by simply jumping to
the exit path if mthca_MODIFY_QP() fails.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With the current IPoIB driver, the status of network interfaces stays
"RUNNING" even if the link goes down (for example because a cable is
unplugged). Fix this by flushing the IPoIB interface when the link
goes down.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_alloc_sqp() by mthca_set_qp_size() need to set qp->transport
before calling mthca_set_qp_size(), since the value is used there.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When setting the shared receive queue (SRQ) watermark in a modify SRQ
operation, make sure that the supplied value is not larger than the
full size of the SRQ.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Guarantee the calculated work queue entry size does not exceed the max
allowable WQE size when creating an SRQ. This is a problem with Arbel
in Tavor-compatibility mode because the current WQE size computation
method rounds up to next power of 2.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a check that the modify QP parameters sgid_index and path_mtu are
valid, since they might come from userspace.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since the SCSI midlayer is moving towards entirely getting rid of
commands with use_sg == 0, we should treat this case as an exception.
Therefore, change the IB SRP initiator to create a fake scatterlist
for these commands with sg_init_one(). This simplifies the flow of
DMA mapping and unmapping, since SRP can just use dma_map_sg() and
dma_unmap_sg() unconditionally, rather than having to choose between
the dma_{map,unmap}_sg() and dma_{map,unmap}_single() variants.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_ib_dev_flush() should get passed cpriv->dev, not &cpriv->dev.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct neigh_ops currently has a destructor field, which no in-kernel
drivers outside of infiniband use. The infiniband/ulp/ipoib in-tree
driver stashes some info in the neighbour structure (the results of
the second-stage lookup from ARP results to real link-level path), and
it uses neigh->ops->destructor to get a callback so it can clean up
this extra info when a neighbour is freed. We've run into problems
with this: since the destructor is in an ops field that is shared
between neighbours that may belong to different net devices, there's
no way to set/clear it safely.
The following patch moves this field to neigh_parms where it can be
safely set, together with its twin neigh_setup. Two additional
patches in the patch series update ipoib to use this new interface.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix endianness handling of srq_limit: it is big-endian in the context
structure, so we need to swab it before returning it.
Also add support for srq_limit query for Tavor (non-MemFree) HCAs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In neigh_add_path(), the queue of delayed packets can never be full,
because the queue is always freshly created and cannot be found by any
other code path. In fact, the test of the queue length is worse than
useless: if somehow the test ever triggered and path_rec_start() also
failed, then dev_kfree_skb_any() will be called twice on the same skb.
Fix this by deleting the useless test. Pointed out by Michael
S. Tsirkin <mst@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MemFree devices need to reserve one shared receive queue (SRQ) work
request for internal use, so the capacity returned from the create_srq
and query_srq methods should be srq->max - 1.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix an oopsable race debugged by Eli Cohen <eli@mellanox.co.il>:
After removing the port from port_list, ib_mad_port_close flushes
port_priv->wq before destroying the special QPs. This means that a
completion event could arrive, and queue a new work in this work queue
after flush.
This patch also removes an unnecessary flush_workqueue():
destroy_workqueue() already includes a flush.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix leak found by Coverity: in the SRP_OPT_DGID case,
srp_parse_options() didn't free the result of match_strdup().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix bug found by coverity: the loop body never executed, because it
was doing for (i = 0; i < MTHCA_EQ_CMD; ++i), but MTHCA_EQ_CMD is 0.
The correct loop bound is MTHCA_NUM_EQ, to loop over all EQs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix two bugs found by coverity:
- Memory leak in error path of alloc_group_attrs()
- Fencepost error in state_show(): the test should be < ARRAY_SIZE(),
not <= ARRAY_SIZE().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move ipoib_ib_dev_flush() to ipoib's workqueue. This keeps it ordered
with respect to other work scheduled by the ipoib driver. This fixes
problems with races, for example:
- ipoib_ib_dev_flush() has started running because of an IB event
- user does ifconfig ib0 down
- ipoib_mcast_stop_thread() gets called twice and waits for the same
completion twice
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the IPoIB build (which is broken in net-2.6.17 because of my
screw-up, which left out this chunk in ipoib_multicast.c).
The neighbour destructor is now in neigh_params, so we don't
need to clear it in the ops structure.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The old code incorrectly used the primary P_Key index as the alternate
index too.
Signed-off-by: Ami Perlmutter <amip@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for sending and receiving large RMPP transfers. The old
code supports transfers only as large as a single contiguous kernel
memory allocation. This patch uses linked list of memory buffers when
sending and receiving data to avoid needing contiguous pages for
larger transfers.
Receive side: copy the arriving MADs in chunks instead of coalescing
to one large buffer in kernel space.
Send side: split a multipacket MAD buffer to a list of segments,
(multipacket_list) and send these using a gather list of size 2.
Also, save pointer to last sent segment, and retrieve requested
segments by walking list starting at last sent segment. Finally,
save pointer to last-acked segment. When retrying, retrieve
segments for resending relative to this pointer. When updating last
ack, start at this pointer.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add SCSI host attributes in sysfs that show the ID extension, IOC
GUID, service ID, P_Key and destination GID for each target port that
the SRP initiator connects to.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move checking the state of a cm_id before modifying it when handling a
REP. This fixes a bug seen under MPI scale-up testing, where a NULL
timewait_info pointer is dereferenced if a request times out before a
REP is received.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Sinai (one-port PCI Express) HCAs get improved throughput for messages
bigger than 80 KB in DDR mode if memory keys are formatted in a
specific way. The enhancement only works if the memory key table is
smaller than 2^24 entries. For larger tables, the enhancement is off
and a warning is printed (to avoid silent performance loss).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The old code didn't convert from the kernel's enum correctly.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to the IB spec version 1.2, section 11.2.4.2, the current
table has a couple of mistakes where it allows the current QP state
(IB_QP_CUR_STATE) attribute. For the transitions:
RTS -> RTS: IB_QP_CUR_STATE should be allowed for all transports
SQD -> SQD: IB_QP_CUR_STATE should never be allowed
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_stop_thread currently tests mcast->query and if it is
NULL, does not perform wait_for_completion on the mcast and frees the
mcast object directly.
However, since both operations are done without locking, it is
possible that ipoib_mcast_join_complete is in progress on this mcast
object and has set mcast->query to NULL already.
Solve this by:
- taking priv->lock before we change mcast->query in ipoib_mcast_join_complete,
and keeping it until we no longer need the mcast object
- taking priv->lock around mcast->query test in ipoib_mcast_stop_thread
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If posting receives in ipoib_ib_dev_open() fails, call
ipoib_ib_dev_stop() to move the device's QP back to the RESET state so
that we can try again later.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a named enum for the HCA's internal page size, rather than having
magic values of 4096 and shifts by 12 all over the code. Also, fix
one minor bug in EQ handling: only one HCA page is mapped to the HCA
during initialization, but a full kernel page is unmapped during
cleanup. This might cause problems when PAGE_SIZE != 4096.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Check that the alternate P_Key index is in range when setting the
alternate path for a QP. Also make a cosmetic touch up to the debug
message printed when the main P_Key index is out of range.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IB_SEND_FENCE flag in post_send methods.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_send() tests mcast->ah twice. If this value is changed
between these two points, we leak an skb. However,
ipoib_mcast_join_finish() sets mcast->ah with no locking, so it could
race against ipoib_mcast_send().
As a solution, take priv->lock around assignment to mcast->ah thus
making sure ipoib_mcast_send() (which also takes priv->lock) is not in
flight.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement query_ah (except for AVs which are in HCA memory). This is
needed to implement RMPP duplicate session detection on sending side
(extraction of DGID/DLID and GRH flag from address handle).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch is checks whether the HCA supports posting FW commands
through a doorbell page (user access region 0, or "UAR0"). If this is
supported, the driver maps UAR0 and uses it for FW commands. This can
be controlled by the value of a writable module parameter
fw_cmd_doorbell. When the parameter is 0, the commands are posted
through HCR using the old method; otherwise if HCA is capable commands
go through UAR0.
This use of UAR0 to post commands eliminates the need for polling the
"go" bit prior to posting a new command. Since reading from a PCI
device is much more expensive then issuing a posted write, it is
expected that issuing FW commands this way will provide better CPU
utilization.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass actual capacity of created SRQ back to userspace, so that
userspace can report accurate capacities. This requires an ABI bump,
to change struct ib_uverbs_create_srq_resp.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Have mthca's create_srq method return the actual capacity of the SRQ
that gets created. Also update comments in <rdma/ib_verbs.h> to
clarify that this is what is expected from ib_create_srq().
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cosmetic change: make alignment explicit in to_ipoib_neigh.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace SRQs (shared
receive queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has the underlying
ib_query_srq() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace QPs (queue pairs),
including adding an ABI for marshalling requests and responses. The
kernel midlayer already has the underlying ib_query_qp() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use ib_modify_qp_is_ok() in mthca, and delete the big table of
attributes for queue pair state transitions.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The in-kernel mthca driver contains a table of which attributes are
valid for each queue pair state transition. It turns out that both
other IB drivers -- ipath and ehca -- which are being prepared for
merging have copied this table, errors and all.
To forestall this code duplication, move this table and the code to
check parameters against it into a midlayer library function,
ib_modify_qp_is_ok().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add low-level driver support to ib_mthca so that consumers can request
a "send queue drained" event be generated when a transiton to the SQD
state completes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The call to ib_get_agent_port() shouldn't be possible to fail when
smi_check_local_dr_smp() is called from ib_mad_recv_done_handler().
When it is called from handle_outgoing_dr_smp(), the device and
port_num come from mad_agent_priv so I assume the call to
ib_get_agent_port() shouldn't fail either. In either case,
smi_check_local_smp() only uses the mad_agent pointer to check that
mad_agent->device->process_mad is not NULL. The device pointer would
have to be the same as the one passed to smi_check_local_dr_smp()
since that pointer is used later instead of the one checked in
smi_check_local_smp().
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
smi_check_local_dr_smp() is called only from two places in core/mad.c
It returns 0 or 1. In smi_check_local_dr_smp(), it checks for
a directed route SMP but this function is only called when the SMP
is a directed route so this is a NOP.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows the consumer to set the page size of "pages" mapped
by the pool FMRs, which is a feature already existing in the base
verbs API. On the cosmetic side it changes ib_fmr_attr.page_size field
to be named page_shift.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a modify_device method to mthca, which implements setting the node
description. This makes the writable "node_desc" sysfs attribute work
for Mellanox HCAs.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Expose a writable "node_desc" sysfs attribute for InfiniBand devices.
This allows userspace to update the node description with information
such as the node's hostname, so that IB network management software
can tie its view to the real world.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle resizing userspace CQs (completion
queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has ib_resize_cq().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The might_sleep() annotations in mthca are silly -- they all occur
shortly before calls that will end up in core functions like kmalloc()
that will print the same warning in an unsafe context anyway. In
fact, beyond cluttering the source, we're actually bloating text with
CONFIG_DEBUG_SPINLOCK_SLEEP and/or CONFIG_PREEMPT_VOLUNTARY set.
With both options set, getting rid of the might_sleep()s saves a lot:
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-171 (-171)
function old new delta
mthca_pd_alloc 132 109 -23
mthca_init_cq 969 946 -23
mthca_mr_alloc 592 568 -24
mthca_pd_free 67 42 -25
mthca_free_mr 219 194 -25
mthca_free_cq 570 545 -25
mthca_fmr_alloc 742 716 -26
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The function mthca_free_err_wqe() can never fail, so get rid of its
return value. That means handle_error_cqe() doesn't have to check
what mthca_free_err_wqe() returns, which means it can't fail either
and doesn't have to return anything either. All this results in
simpler source code and a slight object code improvement:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-10 (-10)
function old new delta
mthca_free_err_wqe 83 81 -2
mthca_poll_cq 1758 1750 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Even after the last fix, it's still possible for a send-only join to
start before the join for the broadcast group has finished. This
could cause us to create a multicast group using attributes from the
broadcast group that haven't been initialized yet, so we would use
garbage for the Q_Key, etc. Fix this by waiting until the broadcast
group's attached flag is set before starting send-only joins.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When debugging is enabled, the mthca_QUERY_DEV_LIM() firmware command
function prints out some of the device limits that it queries.
However the debugging prints happen before all of the fields are
extracted from the firmware response, so some of the values that get
printed are uninitialized junk. Move the prints to the end of the
function to fix this.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Further, there's an additional issue that I saw in testing:
ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if
the device was downed and then upped internally because of a port
event).
If this happends and the send-only join request gets completed before
priv->broadcast is set, we get an oops.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the following race scenario:
- Device is up.
- Port event or set mcast list triggers ipoib_mcast_stop_thread,
this cancels the query and waits on mcast "done" completion.
- Completion is called and "done" is set.
- Meanwhile, ipoib_mcast_send arrives and starts a new query,
re-initializing "done".
Fix this by adding a "multicast started" bit and checking it before
starting a send-only join.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current IB code doesn't work with userspace programs that listen only to
the kernel event netlink socket as it is trying to create its own dev
interface. This small patch fixes this problem, and removes some
unneeded code as the driver core handles this logic for you
automatically.
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix handling of directed route SMPs with a beginning or ending LID
routed part.
Signed-off-by: Ralph Campbell <ralphc@pathscale.com>
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert semaphores to mutexes in mthca. Leave firmware command
interface poll_sem and event_sem as semaphores.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have run into the following problem: if a task receives a signal
while in the process of e.g. destroying a resource (which could be
because the relevant file was closed) mthca could bail out from trying
to take a command interface semaphore without performing the
appropriate command to tell hardware that the resource is being
destroyed.
As a result we see messages like
ib_mthca 0000:04:00.0: HW2SW_CQ failed (-4)
In this case, hardware could access the resource after the memory has
been freed, possibly causing memory corruption.
A simple solution is to replace down_interruptible() by down() in
command interface activation.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
[ It's also not safe to bail out on multicast table operations, since
they may be invoked on the cleanup path too. So use down() for
mcg_table.sem too. ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert srp_host->target_mutex from a semaphore to a mutex.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There are some cards around that have UAR (user access region) size
different from 8 MB. Relax our sanity check to make sure that the PCI
BAR is big enough to access the UAR size reported by the device
firmware instead.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_create_ah() includes the port number in the GID index. The reverse
needs to be done in mthca_read_ah().
Noted by Hal Rosenstock.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Avoid corrupting mcast->pkt_queue by serializing access with
priv->tx_lock. Also, update dropped packet statistics to count
multicast packets removed from pkt_queue as dropped.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
sa_query schedules work on IB asynchronous events. After
unregistering the async event handler, make sure that this work has
completed before releasing the IB device (and possibly allowing the
sa_query module text to go away).
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
uverbs might schedule work to clean up when a file is closed. Make
sure that this work runs before allowing module text to go away.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The SA path record query completion can initialize path->pathrec.dlid
before IPoIB's callback runs and initializes path->ah, so we must test
ah rather than dlid.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove the "inline" keyword from a bunch of big functions in the kernel with
the goal of shrinking it by 30kb to 40kb
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
semaphore to mutex conversion by Ingo and Arjan's script.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Sanity-checked on real IB hardware ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
build_mlx_header() was using sqp->ud_header.grh_present before it was
initialized by mthca_read_ah(). Furthermore, header->grh_present is
set by ib_ud_header_init, so there's no need to set it again in
mthca_read_ah().
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the ALIGN macro to simplify some rounding code.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix memory leaks in mthca_create_qp() and mthca_create_srq()
error handling.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert "/ (1 << lg)" to ">> lg" for a slight code size reduction.
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24)
function old new delta
mthca_map_cmd 613 589 -24
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current handling of multicast groups in IPoIB ends up never
freeing send-only multicast groups. It turns out the logic was much
more complicated than it needed to be; we can fix this bug and
completely kill ipoib_mcast_dev_down() at the same time.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev->mc_list accesses must be protected by dev->xmit_lock.
Found by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Multiple ipoib_neigh structures on mcast->neigh_list may point to the
same ah. This means that ipoib_mcast_free() can't just make a list of
ah structs to free, since this might end up trying to add the same ah
to the list more than once. Handle this in ipoib_multicast.c in the
same way as it is handled in ipoib_main.c for struct ipoib_path.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't leak memory on allocation failure for broadcast mcast group.
Also, print a warning to match handling for other mcast groups.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a node_guid field to struct ib_device. It is the responsibility
of the low-level driver to initialize this field before registering a
device with the midlayer. Convert everyone to looking at this field
instead of calling ib_query_device() when all they want is the node
GUID, and remove the node_guid field from struct ib_device_attr.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor out common code for initializing MAD packets, which is shared
by many query routines in mthca_provider.c, into init_query_mad().
add/remove: 1/0 grow/shrink: 0/4 up/down: 16/-44 (-28)
function old new delta
init_query_mad - 16 +16
mthca_query_port 521 518 -3
mthca_query_pkey 301 294 -7
mthca_query_device 648 641 -7
mthca_query_gid 453 426 -27
Signed-off-by: Roland Dreier <rolandd@cisco.com>
I am seeing EQ overruns in SDP stress tests: if the CQ completion
handler arms a CQ, this could generate more EQEs, so that EQ will
never get empty and consumer index will never get updated.
This is similiar to what we have with command interface:
/*
* cmd_event() may add more commands.
* The card will think the queue has overflowed if
* we don't tell it we've been processing events.
*/
However, for completion events, we *don't* want to update the consumer
index on each event. So, perform EQ doorbell coalescing: allocate EQs
with some spare EQEs, and update once we run out of them.
The value 0x80 was selected to avoid any performance impact.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For all pages except possibly the last one, the byte beyond the buffer
end must be page aligned. Therefore, when computing the page shift to
use, OR the end addresses of the buffers as well as the start
addresses into the mask we check.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Include fixes for 2.6.14-git11. Should allow to remove sched.h from
module.h on i386, x86_64, arm, ia64, ppc, ppc64, and s390. Probably more
to come since I haven't yet checked the other archs.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ib_create_ah_from_wc() doesn't create the correct return address (AH)
when there is a GRH present (source & dest GIDs need to be swapped).
Signed-off-by: Ralph Campbell <ralphc@pathscale.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_uverbs_create_cq() should release the completion channel event file
if an error occurs after it looks it up. Also, if userspace asks for
a completion channel and we don't find it, an error should be returned
instead of silently creating a CQ without a completion channel.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If an operation fails after incrementing an object's reference count,
then it should decrement the reference count on the error path.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add code to modify QP operation to handle setting alternate paths for
connected QPs.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fill vendor_err field in completion with error.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Multicast group management fixes:
. Fix leak of mailbox memory in error handling on multicast group operations.
. Free AMGM indices at detach and in attach error handling.
. Fix amount to shift for aligning next_gid_index in mailbox: it
starts at bit 6, not bit 5.
. Allocate AMGM index after end of MGM table, in the range num_mgms to
multicast table size - 1. Add some BUG_ON checks to catch cases
where the index falls in the MGM hash area.
. Initialize the list of QPs in a newly-allocated group from AMGM to 0
This is necessary since when a group is moved from AMGM to MGM (in the
case where the MGM entry has been emptied of QPs), the AMGM entry is
not reset to 0 (and we don't want an extra command to do that).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
PKEY_INDEX is not a legal parameter in the RTR->RTS transition.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fixes to SQEr->RTS transition in modify_qp:
1. The flag IB_QP_ACCESS_FLAGS is optional for UC qps
2. The SQEr state is not supported for RC qps
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a case where copying max_inline_data from a successful create_qp
capabilities output to create_qp input could cause EINVAL error:
mthca_set_qp_size must check max_inline_data directly against
max_desc_sz; checking qp->sq.max_gs is wrong since max_inline_data
depends on the qp type and does not involve max_sg.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix mthca_create_eq for when the EQ size is not a power of 2.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Modify_qp should check that the physical port number provided
is a legal value.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Check error return on call to mthca_dev_lim for Tavor
(as is done for memfree).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Leave the overloaded "hotplug" word to susbsystems which are handling
real devices. The driver core does not "plug" anything, it just exports
the state to userspace and generates events.
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Thinko: 64 bytes is the minimum SRQ WQE size (not the maximum).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sae and sre bits should only be set when setting sra_max. Further, in
the old code, if the caller specifies max_rd_atomic = 0, the sre and
sae bits are still set, with the result that the QP ends up with
max_rd_atomic = 1 in effect.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch corrects some corner cases in managing the RAE/RRE bits in
the mthca qp context. These bits need to be zero if the user requests
max_dest_rd_atomic of zero. The bits need to be restored to the value
implied by the qp access flags attribute in a previous (or the
current) modify-qp command if the dest_rd_atomic variable is changed
to non-zero.
In the current implementation, the following scenario will not work:
RESET-to-INIT set QP access flags to all disabled (zeroes)
INIT-to-RTR set max_dest_rd_atomic=10, AND
set qp_access_flags = IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_ATOMIC
The current code will incorrectly take the access-flags value set in
the RESET-to-INIT transition.
We can simplify, and correct, this IB_QP_ACCESS_FLAGS handling: it is
always safe to set qp access flags in the firmware command if either
of IB_QP_MAX_DEST_RD_ATOMIC or IB_QP_ACCESS_FLAGS is set, so let's
just set it to the correct value, always.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When cleaning up a CQ for a QP attached to SRQ, need to free an SRQ
WQE only if the CQE is a receive completion.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
break only escapes from the innermost loop, and we want to escape both
loops and return an answer. Noticed by Ishai Rabinovitch.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only change the driver's copy of the QP attributes in modify QP after
checking the modify QP command completed successfully.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix thinko in rd_atomic calculation: ffs(x) - 1 does not find the next
power of 2 -- it should be fls(x - 1).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add limit checking on rd_atomic and dest_rd_atomic attributes:
especially for max_dest_rd_atomic, a value that is larger than HCA
capability can cause RDB overflow and corruption of another QP.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Free the memory allocated in mthca_init_user_db_tab() when releasing
the db_tab in mthca_cleanup_user_db_tab().
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't leak packet if it had a timeout, and don't leak timeout struct
if queue_packet() fails.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use an increasing local ID to avoid re-using identifiers while
messages may still be outstanding on the old ID. Without this, a
quick connect-disconnect-connect sequence can fail by matching
messages for the new connection with the old connection.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change reject code from TIMEOUT to CONSUMER_REJECT when destroying a
cm_id in the process of connecting.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Unlike tavor, the max work queue size is an exact power of 2 for arbel
mode, despite what the documentation (of the QUERY_DEV_LIM firmware
command) says. Without this patch, on Arbel, we can start with a QP
of a valid size and get above the reported limit after rounding to the
next power of two.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
uverbs needs to track which multicast groups is each qp
attached to, in order to properly detach when cleanup
is performed on device file close.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
On mem-free HCAs, when posting a long list of send requests, a
doorbell must be rung every 255 requests. Add code to handle this.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If ipoib_ib_dev_up() fails after ipoib_ib_dev_open() is called, then
ipoib_ib_dev_stop() needs to be called to clean up.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
race condition: ipoib_ib_dev_flush is accessing child list without locks.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_alloc() uses kzalloc(), so there's no need to zero out
members of the mcast struct after it's allocated.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make sure mcast->done is initialized to uncompleted value before we
submit a new query, so that it's safe to wait on.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Always set path->query to NULL when the SA path record query
completes, rather than only when we don't have an address handle.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
It's possible that IPoIB will issue multiple SA queries for the same
path struct. Therefore the struct's completion needs to be
initialized for each query rather than only once when the struct is
allocated, or else we might not wait long enough for later queries to
finish and free the path struct too soon.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_umad_write in user_mad.c is looking at rmpp_hdr field in MAD before
checking that the MAD actually has the RMPP header. So for a MAD
without RMPP header it looks like we are actually checking a bit
inside M_Key, or something.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
last pointer is not updated when QP is modified to reset state. This
causes data corruption if WQEs are already posted on the queue.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The Coverity checker spotted this obvious use-after-release bug caused
by a wrong order of the cleanups.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure that userspace passes in enough data when sending a MAD. We
always copy at least sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR
bytes from userspace, so anything less is definitely invalid. Also,
if the length is less than this limit, it's possible for the second
copy_from_user() to get a negative length and trigger a BUG().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Calculation of QP capabilities still isn't exactly right in mthca:
max_send_sge/max_recv_sge fields returned in create_qp can exceed the
handware supported limits.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Responder resources are only required to handle RDMA reads and atomic
operations, not RDMA writes. So the driver should allow RDMA writes
even if responder resources are set to 0. This is especially
important for the UC transport -- with the old code, it was impossible
to enable RDMA writes for UC QPs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Have __srp_get_tx_iu() fail if the target port's request limit will
not allow the initiator to post a send. This avoids continuing on and
posting a receive, and then failing to post a corresponding send. If
that happens, then the initiator will end up with an extra receive
posted, and if this happens to much, the receive queue will overflow.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Increase SRP max_luns to 512 to match the kernel's default, since SRP
storage targets can have lots of LUNs and the SRP initiator itself
doesn't have any particular limit.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The previous umad deadlock fix left ib_umad_kill_port() still
vulnerable to deadlocking. This patch fixes that by downgrading our
lock to a read lock when we might end up trying to reacquire the lock
for reading.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In Tavor mode, when posting a long list of receive work requests, a
doorbell must be rung every 256 requests. Add code to do this when
required.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Handle case where prod_index has wrapped around and become less than
cq->cons_index by checking that their difference as a signed int is
positive rather than comparing directly.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The size of work requests for atomic operations was computed
incorrectly in mthca: all sizeofs need to be divided by 16.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the computation of QP capabilities (max scatter/gather entries,
max inline data, etc) into the kernel, and have the uverbs module
return the values as part of the create QP response. This keeps
precise knowledge of device limits in the low-level kernel driver.
This requires an ABI bump, so while we're making changes, get rid of
the max_sge parameter for the modify SRQ command -- it's not used and
shouldn't be there.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Now that ib_umad uses the new MAD sending interface, it no longer
needs its own L_Key. So just delete the array of MRs that it keeps.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the struct ib_device.resize_cq() method to take a plain integer
that holds the new CQ size, rather than a pointer to an integer that
it uses to return the new size. This makes the interface match the
exported ib_resize_cq() signature, and allows the low-level driver to
update the CQ size with proper locking if necessary.
No in-tree drivers are exporting this method yet.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a typo in the rearming of the catastrophic error polling timer: we
should rearm the timer as long as the stop flag is _not_ set.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For cut-and-paste reasons, the IPoIB driver was setting skb->dev right
before calling dev_kfree_skb_any(). Get rid of this.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_unregister_mad_agent() completes all pending MAD sends and waits
for the agent's send_handler routine to return. umad's send_handler()
calls queue_packet(), which does down_read() on the port mutex to look
up the agent ID. This means that the port mutex cannot be held for
writing while calling ib_unregister_mad_agent(), or else it will
deadlock. This patch fixes all the calls to ib_unregister_mad_agent()
in the umad module to avoid this deadlock.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add ibX_path files to debugfs that contain information about the IPoIB
path cache. IPoIB ARP only gives GIDs, which the IPoIB driver must
resolve to real IB paths through the ib_sa module. For debugging,
when the ARP table looks OK but traffic isn't flowing, it's useful to
be able to see if the resolution from GID to path worked.
Also clean up the formatting of the existing _mcg debugfs files.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch removes almost all inclusions of linux/version.h. The 3
#defines are unused in most of the touched files.
A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is
unfortunatly in linux/version.h.
There are also lots of #ifdef for long obsolete kernels, this was not
touched. In a few places, the linux/version.h include was move to where
the LINUX_VERSION_CODE was used.
quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'`
search pattern:
/UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the remaining misc drivers/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in misc files in
drivers/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Aristeu Sergio Rozanski Filho <aris@cathedrallabs.org>
Acked-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Pierre Ossman <drzeus@drzeus.cx>
Acked-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Len Brown <len.brown@intel.com>
Acked-by: "Antonino A. Daplas" <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Two small fixes for the umad module:
- set kobject name for issm device properly
- in ib_umad_add_one(), s is subtracted from the index i when
initializing ports, so s should be subtracted from the index when
freeing ports in the error path as well.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Report the device's real page size capability in mthca_query_device().
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make sure that the P_Key index passed into mthca_modify_qp() is
within the device's P_Key table.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix hotplug of devices for ib_umad module: when a device goes away,
kill off all MAD agents for open files associated with that device,
and make sure that the device is not touched again after ib_umad
returns from its remove_one function.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Mellanox has decided that the components of the firmware version are
really meant to be displayed in decimal, e.g. 0x000400070190 is
version 4.7.400. Change the format we use from "%x.%x.%x" to
"%d.%d.%d" to match this convention.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't build ipoib_mcast_iter_ functions if CONFIG_INFINIBAND_IPOIB_DEBUG
is not enabled -- their only callers will not be built either.
Also move the prototype for ipoib_open() to ipoib.h to fix a sparse warning.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an InfiniBand SCSI RDMA Protocol (SRP) initiator. This driver is
used to talk talk to InfiniBand SRP targets (storage devices).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Shrink our source and .text a little by removing a few assignments of
NULL and 0 to memory that is already cleared as part of the allocation.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace kmalloc()+memset(,0,) with kzalloc(), for a net savings of 35
source lines and about 500 bytes of text.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Minor cleanups: fix a misleading comment, and get rid of attr_mask
variables that are only used to hold constants (just use the constants
directly).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix wqe_to_link() to use a structure field that we know is definitely
always unused for receive work requests, so that it really avoids the
free list corruption bug that the comment claims it does.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Userspace CQs that have no completion event channel attached end up
with their cq_context set to NULL. However, asynchronous events like
"CQ overrun" can still occur on such CQs, so add a uverbs_file member
to struct ib_ucq_object that we can follow to deliver these events.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
I recently picked up my older work to remove unnecessary #includes of
sched.h, starting from a patch by Dave Jones to not include sched.h
from module.h. This reduces the number of indirect includes of sched.h
by ~300. Another ~400 pointless direct includes can be removed after
this disentangling (patch to follow later).
However, quite a few indirect includes need to be fixed up for this.
In order to feed the patches through -mm with as little disturbance as
possible, I've split out the fixes I accumulated up to now (complete for
i386 and x86_64, more archs to follow later) and post them before the real
patch. This way this large part of the patch is kept simple with only
adding #includes, and all hunks are independent of each other. So if any
hunk rejects or gets in the way of other patches, just drop it. My scripts
will pick it up again in the next round.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use spin_trylock_irqsave() in ipoib_start_xmit() instead of
reinventing it out of local_irq_save(), spin_trylock() and
local_irq_restore().
Signed-off-by: Roland Dreier <rolandd@cisco.com>