IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB. First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well. Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If connectivity is lost for a sufficient amount of time that the
controller is then deleted, the delete path starts tearing down queues,
and eventually calling nvme_ns_remove(). It appears that pending
commands may cause blk_cleanup_queue() to never complete and the
teardown stalls.
Correct by starting the ns queues after transitioning to a DELETING
state, allowing pending commands to be flushed with io failures. Thus
the delete path is clear when reached.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If an admin command is received while connectivity is list, the ioctl
queues the command on the admin_q and the command stalls (the thread
issuing the ioctl hangs/waits). if the connectivity is lost long
enough such that the controller is then deleted, the delete code
makes its calls to initiate the delete, which then expects the core
layer to call the transport when all references are removed and the
controller can be freed. Unfortunately, nothing in this path dequeued
the admin command, so a reference sits outstanding and things stop,
hanging the delete indefinitely.
Correct by unquiescing the admin queue in the delete association. This
means any admin command (which should only be from an ioctl) issued
after connectivity is lost will detect the controller is in a
reconnecting state and will (fast) fail the command. Thus, a pending
reference can no longer be created. Once connectivity is re-established,
a new ioctl/admin command would see proper device state and function again.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet_req_init looked up a namespace and took a reference on it (unless it
failed prior to that). If the request is uninitialized (in error cases) we
need to remove that reference in case it was taken, otherwise we leak
namespace reference when calling nvme_req_uninit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We use match_strdup() to get a copy of the option string for host ID string, but
we just pass it to uuid_parse() and don't store the string pointer, so we need to
kfree() the string after parsing it.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
fix comment typos in nvme_create_io_queues() like below.
_aount_ to _amount_
_an_ to _can_
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Uses common code for determining if an error should be retried on
alternate path.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was decoded earlier. This
abstraction from the raw NVMe status means all status decoding exists
in one place.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a problem when another module (e.g. nvmet) takes a reference on
the nvme block device and the physical nvme drive is removed. In that
case nvme_free_ctrl() will not be called and the controller state will be
"deleting" or "dead" unless nvmet module releases the block device.
Later on, the same nvme drive probes back and nvme_init_subsystem() will
be called and fail due to duplicate subnqn (if the nvme device doesn't
support subsystem with multiple controllers). This will cause a probe
failure. This commit changes the check of multiple controllers support
at nvme_init_subsystem() by not counting all the controllers at "dead" or
"deleting" state (this is safe because controllers at this state will
never be active again).
Fixes: ab9e00cc72fa ("nvme: track subsystems")
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The block device is backed by the transport so we must ensure that the
transport driver will not be removed until all references are released.
Otherwise, we might end up referencing freed memory.
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When the io queues setup or tagset allocation failed, ctrl.tagset is
NULL. But the scan work will still be queued and executed, then panic
comes up due to NULL pointer reference of ctrl.tagset.
To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only
admin queue is live. When non io queues or tagset allocation failed, ctrl
enters into this state, scan work will not be started. But async event
work and nvme dev ioctl will be still available. This will be helpful to
do further investigation and recovery.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When an NVMe controller reports RTD3 Entry Latency larger than the value
of shutdown_timeout module parameter, we update the shutdown_timeout
accordingly to honor RTD3 Entry Latency. Use an informational debug level
instead of a warning level for it.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Make it symmetric to nvmet_alloc_ctrl().
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove the allocated id on error.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The local variable __size__ will be set a bit later in a for-loop.
Remove the explicit initialization at the beginning of this function.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
NVMe transport driver module unload may (and usually does) trigger
iteration over the active controllers and delete them all (sometimes
under a mutex). However, a controller can be created concurrently with
module unload which can lead to leakage of resources (most important char
device node leakage) in case the controller creation occured after the
unload delete and drain sequence. To protect against this, we take a
module reference to guarantee that the nvme transport driver is not
unloaded while creating a controller.
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The current fc transport add_port routine validates that there is a
matching port to the target port config. It then takes a reference
on the targetport. The del_port removes the reference.
Unfortunately, if the LLDD undergoes a hw reset or driver unload and
wants to unreg the targetport, due to the reference, the targetport
effectively can't be removed. It requires the admin to remove the
port from the nvmet config first, which calls the del_port.
Note: it appears nvmetcli clear skips over the del_port call (I'm
not attempting to change that).
There's no real reason to take the reference. With FC, there is nothing
to enable or disable as the presence of the FC targetport implicitly
means its enabled, and removal of the targtport means its disabled.
Change add_port to simply validate and change remove_port to a noop.
No references are taken on the targetport.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The split between what the host accesses on its flows vs what the
target side accesses was flawed. Abort handling didn't properly
clear initiator vs target structure cross-reference and locks
weren't used for synchronization. Thus, there were issues of
freeing structures too soon and access after free.
A couple of these existed pre the IN_ISR mods, but when the
target upcalls were converted to work items, thus adding delays
between the 2 sides of accesses, the problems became pronounced.
Resolve by:
- tracking io state mainly in the tgt-side io structure.
- make the tgt-side io structure released by reference not by
code flow.
- when changing initiator structures, use locks for
synchronization
- aborts are clearly tracked for which side saw the abort, and
after seeing the abort, cross-references are cleared under lock.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The existing fcloop driver expects the target side upcalls to
the transport to context switch, thus the calls into the nvmet layer
are not done in the calling context of the host/initiator down calls.
The xxx_IN_ISR feature flags are used to select this logic.
The xxx_IN_ISR feature flags should go away in the nvmet_fc transport
as no other lldd utilizes them. Both Broadcom and Cavium lldds have their
own non-ISR deferred handlers thus the nvmet calls can be made directly.
This patch converts the paths that make the target upcalls (command
receive, abort receive) such that they schedule a work item rather
than expecting the transport to schedule the work item.
The patch also cleans up the following:
- The completion path from target to host scheduled a host work
element called "work". Rename it "tio_done_work" for code clarity.
- The abort io path called a iniwork item to call the host side
io done. This is no longer needed as the abort routine can make
the same call.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The current fcloop driver gets its lport structure from the private
area co-allocated with the fc_localport. All is fine except the
teardown path, which wants to wait on the completion, which is marked
complete by the delete_localport callback performed after
unregister_localport. The issue is, the nvme_fc transport frees the
localport structure immediately after delete_localport is called,
meaning the original routine is trying to wait on a complete that
was just freed.
Change such that a lport struct is allocated coincident with the
addition and registration of a localport. The private area of the
localport now contains just a backpointer to the real lport struct.
Now, the completion can be waited for, and after completing, the
new structure can be kfree'd.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A test case revealed a race condition of an i/o completing on a thread
parallel to the delete_association generating the aborts for the
outstanding ios on the controller. The i/o completion was freeing the
target fcloop context, thus the abort task referenced the just-freed
memory.
Correct by clearing the target/initiator cross pointers in the io
completion and abort tasks before calling the callbacks. On aborts
that detect already finished io's, ensure the complete context is
called.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
It is a bit chatty to report on each queue, log it only for debug
purposes.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
It is a bit chatty to report on every deleted queue, so keep it for debug
purposes only.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We already do that when we are notified in device removal
which is triggered when unregistering as an ib client.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for the 2.0 revision by adapting the geometry
structures to coexist with the 1.2 revision.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The lower page table is unused. All page tables reported by 1.2
devices are all reporting a sequential 1:1 page mapping. This is
also not used going forward with the 2.0 revision.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that rrpc have been removed. Also remove the hybrid 1.2 support
from the core.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Following condition which will cause NULL pointer dereference will
occur in nvme_free_host_mem() when it tries to remove pci device via
nvme_remove() especially after a failure of host memory allocation for HMB.
"(host_mem_descs == NULL) && (nr_host_mem_descs != 0)"
It's because __nr_host_mem_descs__ is not cleared to 0 unlike
__host_mem_descs__ is so.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In case nvme_rdma_wait_for_cm timeout expires before we get
an established or rejected event (rdma_connect succeeded) from
rdma_cm, we end up with leaking the ib transport resources for
dedicated queue. This scenario can easily reproduced using traffic
test during port toggling.
Also, in order to protect from parallel ib queue destruction, that
may be invoked from different context's, introduce new flag that
stands for transport readiness. While we're here, protect also against
a situation that we can receive rdma_cm events during ib queue destruction.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. If
an I/O scheduler is used, it doesn't iterate the hctx scheduler tags and
the static request aren't been updated. For example, while using NVMe
over Fabrics RDMA host, this cause us not to reinit the scheduler
requests and thus not re-register all the memory regions during the
tagset re-initialization in the reconnect flow.
This may lead to a memory registration error:
"MEMREG for CQE 0xffff88044c14dce8 failed with status memory management operation error (6)"
With this commit we don't need to reinit the requests, and thus fix this
failure.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If we got a remote invalidation on a bogus rkey, this is a protocol error.
Fail the connection in this case.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support so
we usually don't need to execute it, but in case the target did not
invalidate a memory region for us, we must wait for the invalidation to
complete before unmapping host memory and completing the I/O.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion and
the nvme cqe has arrived. We need to set the send/recv completions flags
atomically because we might have more than a single context accessing the
request concurrently (one is cq irq-poll context and the other is
user-polling used in IOCB_HIPRI).
Only then we are safe to invalidate the rkey (if needed), unmap the host
buffers, and complete the IO.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The entire completions suppress mechanism is currently broken because the
HCA might retry a send operation (due to dropped ack) after the nvme
transaction has completed.
In order to handle this, we signal all send completions and introduce a
separate done handler for async events as they will be handled differently
(as they don't include in-capsule data by definition).
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
So far harmless, but it's confusing and a bug waiting to happen if the
shifts grow larger than 4.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
And increase the existing delay to cover this device as well.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Whenever a cmd is received a reference is taken while looking up the
queue. The reference is removed after the cmd is done as the iod is
returned for reuse. The fod may be reused for a deferred (recevied but
no job context) cmd. Existing code removes the reference only if the
fod is not reused for another command. Given the fod may be used for
one or more ios, although a reference was taken per io, it won't be
matched on the frees.
Remove the reference on every fod free. This pairs the references to
each io.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The ns->head is always valid, so we don't need to check for NULL.
Reported-by: Dan Carpenter <dan.caprenter@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This fixes using the NULL 'head' before getting the reference. It is
however possible the head will always be NULL, so this patch uses the
struct nvme_ns to get the ns_id field.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Trivial fix to spelling mistake in dev_warn_ratelimited message text
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
hmb descriptor idx out-of-bound occurs in case of below conditions.
preferred = 128MiB
chunk_size = 4MiB
hmmaxd = 1
Current code will not allow rmmod which will free hmb descriptors
to be done successfully in above case.
"descs[i]" will be set in for-loop without seeing any conditions
related to "max_entries" after a single "descs" was allocated by
(max_entries = 1) in this case.
Added a condition into for-loop to check index of descriptors.
Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations")
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The NVMe device in question drops off the PCIe bus after system suspend.
I've tried several approaches to workaround this issue, but none of them
works:
- NVME_QUIRK_DELAY_BEFORE_CHK_RDY
- NVME_QUIRK_NO_DEEPEST_PS
- Disable APST before controller shutdown
- Delay between controller shutdown and system suspend
- Explicitly set power state to 0 before controller shutdown
Fortunately it's a desktop, so disable APST won't hurt the battery.
Also, change the quirk function name to reflect it's for vendor
combination quirks.
BugLink: https://bugs.launchpad.net/bugs/1705748
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In case the queue is not LIVE (fully functional and connected at the nvmf
level), we cannot allow any commands other than connect to pass through.
Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect
and cleared in queue teardown.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>