IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up. It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue, we can
have one device in the middle of a down and holding the rtnl lock and
another totally unrelated device needing to run mcast_restart_task,
which wants the rtnl lock and will loop trying to take it unless is
sees its own FLAG_ADMIN_UP flag go away. Because the unrelated
interface will never see its own ADMIN_UP flag drop, the interface
going down will deadlock trying to flush the queue. There are several
possible solutions to this problem:
Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail. This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.
Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail. Again, this can
drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our
own task back up as a delayed work and return and avoid dropping work
on the floor that way. But I'm not 100% convinced that we won't cause
other problems.
Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface. In addition, keep the global
workqueue, but now limit it to only flush tasks. In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Our mcast_dev_flush routine and our mcast_restart_task can race
against each other. In particular, they both hold the priv->lock
while manipulating the rbtree and while removing mcast entries from
the multicast_list and while adding entries to the remove_list, but
they also both drop their locks prior to doing the actual removes.
The mcast_dev_flush routine is run entirely under the rtnl lock and so
has at least some locking. The actual race condition is like this:
Thread 1 Thread 2
ifconfig ib0 up
start multicast join for broadcast
multicast join completes for broadcast
start to add more multicast joins
call mcast_restart_task to add new entries
ifconfig ib0 down
mcast_dev_flush
mcast_leave(mcast A)
mcast_leave(mcast A)
As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue. To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used. We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it. This made it less
than totally effective, and certainly made its use confusing. And in
the flush function we would use the presence of this flag to signal
that we should wait on the completion struct, but we never cleared
this flag, ever. This is further muddied by the fact that we overload
the MCAST_FLAG_BUSY flag to mean two different things: we have a join
in flight, and we have succeeded in getting an ib_sa_join_multicast.
In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.
1) Remove the MCAST_JOIN_STARTED flag entirely
2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test on mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback. This forces
ib_sa_multicast_join to return and set mcast->mc before we process
the callback. This way, our callback can safely clear mcast->mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task
A number of different races are resolved with these changes. These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.
One race looks something like this:
Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
alloc member
call callback
ifconfig ib0 down
wait_for_completion
callback call completes
wait_for_completion in
mcast_dev_flush completes
mcast->mc is PTR_ERR_OR_NULL
so we skip ib_sa_leave_multicast
return from callback
return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast
We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c
Another like this:
Thread 1 Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv->broadcast = NULL
join_complete
wait_for_completion
mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
complete
return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever
By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback. This allows us to clear mcast->mc if there is an
error and the join is going to fail. We do this before we complete
the mcast. In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface. Unfortunately, that's not
the case. In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too. But this thread is
deadlocked on the rtnl lock. The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_ADMIN_UP has been cleared, in which case we don't need to do
anything anyway and we just return.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Setting the MTU can safely be moved to the carrier_on_task, which keeps
us from needing to take the rtnl lock in the join_finish section.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
CC [M] drivers/infiniband/ulp/isert/ib_isert.o
drivers/infiniband/ulp/isert/ib_isert.c: In function ‘isert_cq_comp_err’:
drivers/infiniband/ulp/isert/ib_isert.c:1979:42: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
- Fall-through in switch case instead in do_control_comp.
- Move rkey invalidation to a function.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
debug_level 1 (warn): Include warning messages.
debug_level 2 (info): Include relevant info for control plane.
debug_level 3 (debug): Include relevant info in the IO path.
Also, added/removed some logging messages.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Personal preference, easier control of the log level with
a single modparam which can be changed dynamically. Allows
better saparation of control and IO plains.
Replaced throughout ib_isert.c:
s/pr_debug/isert_dbg/g
s/pr_info/isert_info/g
s/pr_warn/isert_warn/g
s/pr_err/isert_err/g
Plus nit checkpatch warning change.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
We don't want to wait for conn_logout_comp from isert_comp_wq
context as this blocks further completions from being processed.
Instead we wait for it conditionally (if logout response was
actually posted) in wait_conn. This wait should normally happen
immediately as it occurs after we consumed all the completions
(including flush errors) and conn_logout_comp should have been
completed.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Might result in a deadlock where completion context waits for
session commands release where the later might need a final
completion for it.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In order to reduce the contention on CQ locking (present
in some LLDDs) we poll in batches of 16 work completion items.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In case the CQ is packed with completions, we can't just
hog the CPU forever. Poll until a sufficient budget (currently
hard-coded to 64k completions) and if budget is exhausted, bailout
and give a chance to other threads.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In order to know that we consumed all the connection completions
we maintain atomic post_send_buf_count for each IO post send. But
we can know that if we post a "beacon" (zero length RECV work request)
after we move the QP into error state and the target does not serve
any new IO. When we consume it, we know we finished all the connection
completion and we can go ahead and destroy stuff.
In error completion handler we now just need to check for ISERT_BEACON_WRID
to arrive and then wait for session commands to cleanup and complete
conn_wait_comp_err.
We reserve another CQ and QP entries to fit the zero length post recv.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
We are calling session reinstatement, wait_conn will start
connection termination.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Using TX and RX CQs attached to the same vector might
create a throttling effect coming from the serial processing
of a work-queue. Use one CQ instead, it will do better in interrupt
processing and it provides a simpler code. Also, We get rid of
redundant isert_rx_wq.
Next we can remove the atomic post_send_buf_count from the IO path.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
A pre-step before going to a single CQ.
Also this makes the code a little more simple to
read.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Nit, uintptr_t is designed for pointer casting, use it.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
As a pre-step to a single CQ, we unite the error completion
handlers to a single handler.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
It is disabled at the moment, we will get that back
in once the target is more stable.
This reverts commit 95b60f0
"Add support for completion interrupt coalescing"
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Currently we have no way to tell that the target stack is in shutdown
sequence. In case we have open connections, the initiator immediately
attempts to reconnect in a DDOS attack style, so we may end up
terminating the iser enabled network portal while it's np_accept_list
still have pending connections.
The workaround is simply release all the connections in the list.
A proper fix will be to start shutdown sequence by shutting the
network portal to avoid initiator immediate reconnect attempts.
But the temporary work around seems to work at this point, so I think
we can do this for now...
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
iSER will report supported protection operations based on
the tpg attribute t10_pi settings and HCA PI offload capabilities.
If the HCA does not support PI offload or tpg attribute t10_pi is
not set, we fall to SW PI mode.
In order to do that, we move iscsit_get_sup_prot_ops after connection
tpg assignment.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Fallback to software mode DIF if HCA does not support
PI (without crashing obviously). It is still possible to
run with backend protection and an unprotected frontend,
so looking at the command prot_op is not enough. Check
device PI capability on a per-IO basis (isert_prot_cmd
inline static) to determine if we need to handle protection
information.
Trace:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffffa037f8b1>] isert_reg_sig_mr+0x351/0x3b0 [ib_isert]
Call Trace:
[<ffffffff812b003a>] ? swiotlb_map_sg_attrs+0x7a/0x130
[<ffffffffa038184d>] isert_reg_rdma+0x2fd/0x370 [ib_isert]
[<ffffffff8108f2ec>] ? idle_balance+0x6c/0x2c0
[<ffffffffa0382b68>] isert_put_datain+0x68/0x210 [ib_isert]
[<ffffffffa02acf5b>] lio_queue_data_in+0x2b/0x30 [iscsi_target_mod]
[<ffffffffa02306eb>] target_complete_ok_work+0x21b/0x310 [target_core_mod]
[<ffffffff8106ece2>] process_one_work+0x182/0x3b0
[<ffffffff8106fda0>] worker_thread+0x120/0x3c0
[<ffffffff8106fc80>] ? maybe_create_worker+0x190/0x190
[<ffffffff8107594e>] kthread+0xce/0xf0
[<ffffffff81075880>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff8159a22c>] ret_from_fork+0x7c/0xb0
[<ffffffff81075880>] ? kthread_freezable_should_stop+0x70/0x70
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch converts to allocate PI contexts dynamically in order
avoid a potentially bogus np->tpg_np and associated NULL pointer
dereference in isert_connect_request() during iser-target endpoint
shutdown with multiple network portals.
Also, there is really no need to allocate these at connection
establishment since it is not guaranteed that all the IOs on
that connection will be to a PI formatted device.
We can do it in a lazy fashion so the initial burst will have a
transient slow down, but very fast all IOs will allocate a PI
context.
Squashed:
iser-target: Centralize PI context handling code
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In situations such as bond failover, The new session establishment
implicitly invokes the termination of the old connection.
So, we don't want to wait for the old connection wait_conn to completely
terminate before we accept the new connection and post a login response.
The solution is to deffer the comp_wait completion and the conn_put to
a work so wait_conn will effectively be non-blocking (flush errors are
assumed to come very fast).
We allocate isert_release_wq with WQ_UNBOUND and WQ_UNBOUND_MAX_ACTIVE
to spread the concurrency of release works.
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The np listener cm_id will also get ADDR_CHANGE event
upcall (in case it is bound to a specific IP). Handle
it correctly by creating a new cm_id and implicitly
destroy the old one.
Since this is the second event a listener np cm_id may
encounter, we move the np cm_id event handling to a
routine.
Squashed:
iser-target: Move cma_id setup to a function
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Take isert_conn pointer from cm_id->qp->qp_context. This
will allow us to know that the cm_id context is always
the network portal. This will make the cm_id event check
(connection or network portal) more reliable.
In order to avoid a NULL dereference in cma_id->qp->qp_context
we destroy the qp after we destroy the cm_id (and make the
dereference safe). session stablishment/teardown sequences
can happen in parallel, we should take into account that
connected_handler might race with connection teardown flow.
Also, protect isert_conn->conn_device->active_qps decrement
within the error patch during QP creation failure and the
normal teardown path in isert_connect_release().
Squashed:
iser-target: Decrement completion context active_qps in error flow
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
There is no point in accepting a new CM request only
when we are completely done with the last iscsi login.
Instead we accept immediately, this will also cause the
CM connection to reach connected state and the initiator
is allowed to send the first login. We mark that we got
the initial login and let iscsi layer pick it up when it
gets there.
This reduces the parallel login sequence by a factor of
more then 4 (and more for multi-login) and also prevents
the initiator (who does all logins in parallel) from
giving up on login timeout expiration.
In order to support multiple login requests sequence (CHAP)
we call isert_rx_login_req from isert_rx_completion insead
of letting isert_get_login_rx call it.
Squashed:
iser-target: Use kref_get_unless_zero in connected_handler
iser-target: Acquire conn_mutex when changing connection state
iser-target: Reject connect request in failure path
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
ISER_CONN_UP state is not sufficient to know if
we should wait for completion of flush errors and
disconnected_handler event.
Instead, split it to 2 states:
- ISER_CONN_UP: Got to CM connected phase, This state
indicates that we need to wait for a CM disconnect
event before going to teardown.
- ISER_CONN_FULL_FEATURE: Got to full feature phase
after we posted login response, This state indicates
that we posted recv buffers and we need to wait for
flush completions before going to teardown.
Also avoid deffering disconnected handler to a work,
and handle it within disconnected handler.
More work here is needed to handle DEVICE_REMOVAL event
correctly (cleanup all resources).
Squashed:
iser-target: Don't deffer disconnected handler to a work
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Since commit 0fc4ea701fcf ("Target/iser: Don't put isert_conn inside
disconnected handler") we put the conn kref in isert_wait_conn, so we
need .wait_conn to be invoked also in the error path.
Introduce call to isert_conn_terminate (called under lock)
which transitions the connection state to TERMINATING and calls
rdma_disconnect. If the state is already teminating, just bail
out back (temination started).
Also, make sure to destroy the connection when getting a connect
error event if didn't get to connected (state UP). Same for the
handling of REJECTED and UNREACHABLE cma events.
Squashed:
iscsi-target: Add call to wait_conn in establishment error flow
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Conflicts:
drivers/scsi/scsi_debug.c
Agreed and tested resolution to a merge problem between a fix in scsi_debug
and a driver update
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
For SPI drivers use the message definitions from scsi.h, and for target
drivers introduce a new TCM_*_TAG namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com
Since we got rid of ordered tag support in 2010 the prime use case of
switching on and off ordered tags has been obsolete. The other function
of enabling/disabling tagging entirely has only been correctly implemented
by the 53c700 driver and isn't generally useful.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Drop the now unused reason argument from the ->change_queue_depth method.
Also add a return value to scsi_adjust_queue_depth, and rename it to
scsi_change_queue_depth now that it can be used as the default
->change_queue_depth implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
We won't ever queue more commands than the host allows. Instead of
letting drivers either reject or ignore this case handle it in
common code. Note that various driver use internal constant or
variables that are assigned to both shost->can_queue and checked
in ->change_queue_depth - I did remove those checks as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
All drivers use the implementation for ramping the queue up and down, so
instead of overloading the change_queue_depth method call the
implementation diretly if the driver opts into it by setting the
track_queue_depth flag in the host template.
Note that a few drivers validated the new queue depth in their
change_queue_depth method, but as we never go over the queue depth
set during slave_configure or the sysfs file this isn't nessecary
and can safely be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Venkatesh Srinivas <venkateshs@google.com>
isert has an issue of trying to create a CQ with more CQEs than are
supported by the hardware, that currently results in failures during
isert_device creation during first session login.
This is the isert version of the patch that Minh Tran submitted for
iser, and is simple a workaround required to function with existing
ocrdma hardware.
Signed-off-by: Chris Moore <chris.moore@emulex.com>
Reviewied-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
At least LID reassignment can trigger a race condition in the SRP
initiator driver, namely the receive completion handler trying to
post a request on a QP during or after QP destruction and before
the CQ's have been destroyed. Avoid this race by modifying a QP
into the error state and by waiting until all receive completions
have been processed before destroying a QP.
Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Improve performance by using multiple RDMA/RC channels per SCSI
host for communication with an SRP target. About the
implementation:
- Introduce a loop over all channels in the code that uses
target->ch.
- Set the SRP_MULTICHAN_MULTI flag during login for the creation
of the second and subsequent channels.
- RDMA completion vectors are chosen such that RDMA completion
interrupts are handled by the CPU socket that submitted the I/O
request. As one can see in this patch it has been assumed if a
system contains n CPU sockets and m RDMA completion vectors
have been assigned to an RDMA HCA that IRQ affinity has been
configured such that completion vectors [i*m/n..(i+1)*m/n) are
bound to CPU socket i with 0 <= i < n.
- Modify srp_free_ch_ib() and srp_free_req_data() such that it
becomes safe to invoke these functions after the corresponding
allocation function failed.
- Add a ch_count sysfs attribute per target port.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Since the block layer already contains functionality to assign
a tag to each request, use that functionality instead of
reimplementing that functionality in the SRP initiator driver.
This change makes the free_reqs list superfluous. Hence remove
that list.
[hch: updated to use .use_blk_tags instead scsi_activate_tcq]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Changes in this patch:
- Move channel variables into a new structure (struct srp_rdma_ch).
- Add an srp_target_port pointer, 'lock' and 'comp_vector' members
in struct srp_rdma_ch.
- Add code to initialize these three new member variables.
- Many boring "target->" into "ch->" changes.
- The cm_id and completion handler context pointers are now of type
srp_rdma_ch * instead of srp_target_port *.
- Three kzalloc(a * b, f) calls have been changed into kcalloc(a, b, f)
to avoid that this patch would trigger a checkpatch warning.
- Two casts from u64 into unsigned long long have been left out
because these are superfluous. Since considerable time u64 is
defined as unsigned long long for all architectures supported by
the Linux kernel.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Introduce the srp_target_port member variables 'sgid' and 'pkey'.
Change the type of 'orig_dgid' from __be16[8] into union ib_gid.
This patch does not change any functionality but makes the
"Separate target and channel variables" patch easier to verify.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If a cable is pulled during LUN scanning it can happen that the
SRP rport and the SCSI host have been created but no LUNs have been
added to the SCSI host. Since multipathd only sends SCSI commands
to a SCSI target if one or more SCSI devices are present and since
there is no keepalive mechanism for IB queue pairs this means that
after a LUN scan failed and after a reconnect has succeeded no
data will be sent over the QP and hence that a subsequent cable
pull will not be detected. Avoid this by not creating an rport or
SCSI host if a cable is pulled during a SCSI LUN scan.
Note: so far the above behavior has only been observed with the
kernel module parameter ch_count set to a value >= 2.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Attempting to connect three times may be insufficient after an
initiator system tries to relogin, especially if the relogin
attempt occurs before the SRP target service ID has been
registered. Since the srp_daemon retries a failed login attempt
anyway, remove the stale connection retry mechanism.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The patch that adds multichannel support into the SRP initiator
driver introduces an additional call to srp_free_ch_ib(). This
patch helps to keep that later patch simple.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove the tagged argument from scsi_adjust_queue_depth, and just let it
handle the queue depth. For most drivers those two are fairly separate,
given that most modern drivers don't care about the SCSI "tagged" status
of a command at all, and many old drivers allow queuing of multiple
untagged commands in the driver.
Instead we start out with the ->simple_tags flag set before calling
->slave_configure, which is how all drivers actually looking at
->simple_tags except for one worke anyway. The one other case looks
broken, but I've kept the behavior as-is for now.
Except for that we only change ->simple_tags from the ->change_queue_type,
and when rejecting a tag message in a single driver, so keeping this
churn out of scsi_adjust_queue_depth is a clear win.
Now that the usage of scsi_adjust_queue_depth is more obvious we can
also remove all the trivial instances in ->slave_alloc or ->slave_configure
that just set it to the cmd_per_lun default.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Most drivers use exactly the same implementation, so provide it as a
library function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
In this case the cm_id->context is the isert_np, and the cm_id->qp
is NULL, so use that to distinct the cases.
Since we don't expect any other events on this cm_id we can
just return -1 for explicit termination of the cm_id by the
cma layer.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>