IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit b6a545ffa2c192b1e6da4a7924edac5ba9f4ea2b ]
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(),
add a bounds check to avoid out-of-bounds access.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a8eb6c1ba48bddea82e8d74cbe6e119f006be97d ]
The firmware revision can change on after a reset so copy the most
recent info each time instead of just the first time, otherwise the
sysfs firmware_rev entry may contain stale data.
Reported-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 61ce339f19fabbc3e51237148a7ef6f2270e44fa upstream.
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device. Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.
Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 72e3b8883a36e80ebfa41015c7b6926ce31ace05 upstream.
When we revalidate paths as part of ns size change (as of commit
e7d65803e2bb), it is possible that during the path revalidation, the
only paths that is IO capable (i.e. optimized/non-optimized) are the
ones that ns resize was not yet informed to the host, which will cause
inflight requests to be requeued (as we have available paths but none
are IO capable). These requests on the requeue list are waiting for
someone to resubmit them at some point.
The IO capable paths will eventually notify the ns resize change to the
host, but there is nothing that will kick the requeue list to resubmit
the queued requests.
Fix this by always kicking the requeue list, and if no IO capable path
exists, these requests will be queued again.
A typical log that indicates that IOs are requeued:
--
nvme nvme1: creating 4 I/O queues.
nvme nvme1: new ctrl: "testnqn1"
nvme nvme2: creating 4 I/O queues.
nvme nvme2: mapped 4/0/0 default/read/poll queues.
nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
nvme nvme1: rescanning namespaces.
nvme1n1: detected capacity change from 2097152 to 4194304
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
nvme nvme2: rescanning namespaces.
--
Reported-by: Yogev Cohen <yogev@lightbitslabs.com>
Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c292a337d0e45a292c301e3cd51c35aa0ae91e95 ]
The IOC_PR_CLEAR and IOC_PR_RELEASE ioctls are
non-functional on NVMe devices because the nvme_pr_clear()
and nvme_pr_release() functions set the IEKEY field incorrectly.
The IEKEY field should be set only when the key is zero (i.e,
not specified). The current code does it backwards.
Furthermore, the NVMe spec describes the persistent
reservation "clear" function as an option on the reservation
release command. The current implementation of nvme_pr_clear()
erroneously uses the reservation register command.
Fix these errors. Note that NVMe version 1.3 and later specify
that setting the IEKEY field will return an error of Invalid
Field in Command. The fix will set IEKEY when the key is zero,
which is appropriate as these ioctls consider a zero key to
be "unspecified", and the intention of the spec change is
to require a valid key.
Tested on a version 1.4 PCI NVMe device in an Azure VM.
Fixes: 1673f1f08c88 ("nvme: move block_device_operations and ns/ctrl freeing to common code")
Fixes: 1d277a637a71 ("NVMe: Add persistent reservation ops")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 478814a5584197fa1fb18377653626e3416e7cd6 ]
TCP_FIN_WAIT2 and TCP_LAST_ACK were not handled, the connection is closing
so we can ignore them and avoid printing the "unhandled state"
warning message.
[ 1298.852386] nvmet_tcp: queue 2 unhandled state 5
[ 1298.879112] nvmet_tcp: queue 7 unhandled state 5
[ 1298.884253] nvmet_tcp: queue 8 unhandled state 5
[ 1298.889475] nvmet_tcp: queue 9 unhandled state 5
v2: Do not call nvmet_tcp_schedule_release_queue(), just ignore
the fin_wait2 and last_ack states.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b7e97872a65e1d57b4451769610554c131f37a0a ]
Maximum Active Resources (MAR) and Maximum Open Resources (MOR) are 0's
based vales where a value of 0xffffffff indicates that there is no limit.
Decrement the values that are returned by bdev_max_open_zones and
bdev_max_active_zones as the block layer helpers are not 0's based.
A 0 returned by the block layer helpers indicates no limit, thus convert
it to 0xffffffff (U32_MAX).
Fixes: aaf2e048af27 ("nvmet: add ZBD over ZNS backend support")
Suggested-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3770a42bb8ceb856877699257a43c0585a5d2996 ]
When we queue requests, we strive to batch as much as possible and also
signal the network stack that more data is about to be sent over a socket
with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
as well as queue->more_requests that is derived from the block layer
last-in-batch indication.
We set more_request=true when we flush the request directly from
.queue_rq submission context (in nvme_tcp_send_all), however this is
wrongly assuming that no other requests may be queued during the
execution of nvme_tcp_send_all.
Due to this, a race condition may happen where:
1. request X is queued as !last-in-batch
2. request X submission context calls nvme_tcp_send_all directly
3. nvme_tcp_send_all is preempted and schedules to a different cpu
4. request Y is queued as last-in-batch
5. nvme_tcp_send_all context sends request X+Y, however signals for
both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.
==> none of the requests is pushed down to the wire as the network
stack is waiting for more data, both requests timeout.
To fix this, we eliminate queue->more_requests and only rely on
the queue req_list and send_list to be not-empty.
Fixes: 122e5b9f3d37 ("nvme-tcp: optimize network stack with setting msg flags according to batch size")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 160f3549a907a50e51a8518678ba2dcf2541abea ]
We should also bail from the io_work loop when we set rd_enabled to true,
so we don't attempt to read data from the socket when the TCP stream is
already out-of-sync or corrupted.
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2aba0d19f4d8c8929b4b3b94a9cfde2aa20e6ee2 upstream
Add a helper to check the max supported sectors for zone append based on
the block_device instead of having to poke into the block layer internal
request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 533d2e8b4d5e4c89772a0adce913525fb86cbbee ]
We probably need nvmet_tcp_wq to have MEM_RECLAIM as we are
sending/receiving for the socket from works on this workqueue.
Also this eliminates lockdep complaints:
--
[ 6174.010200] workqueue: WQ_MEM_RECLAIM
nvmet-wq:nvmet_tcp_release_queue_work [nvmet_tcp] is flushing
!WQ_MEM_RECLAIM nvmet_tcp_wq:nvmet_tcp_io_work [nvmet_tcp]
[ 6174.010216] WARNING: CPU: 20 PID: 14456 at kernel/workqueue.c:2628
check_flush_dependency+0x110/0x14c
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e06b425bc835ead08b9fd935bf5e47eef473e7a0 ]
nvme_revalidate_zones can also return -ENODEV if e.g. zone sizes aren't
constant or not a power of two. In that case we should jump to marking
the gendisk hidden and only support pass through.
Fixes: 602e57c9799c ("nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info")
Reported-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 363f6368603743072e5f318c668c632bccb097a3 ]
When a fabrics controller claims to support an invalidate metadata
configuration we already warn and disable metadata support. No need to
also return an error during revalidation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d39ad2a45c0e38def3e0c95f5b90d9af4274c939 ]
The only fabrics target that supports metadata handling through the
separate integrity buffer is RDMA. It is currently usable only if the
size is 8B per block and formatted for protection information. If an
rdma target were to export a namespace with a different format (ex:
4k+64B), the driver will not be able to submit valid read/write commands
for that namespace.
Suppress setting the metadata feature in the namespace so that the
gendisk capacity will be set to 0. This will prevent read/write access
through the block stack, but will continue to allow ioctl passthrough
commands.
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a25d4261582cf00dad884c194d21084836663d3d ]
Commit 89b3d6e60550 ("nvme: simplify the compat ioctl handling") removed
the initialization of compat_ioctl from the nvme block_device_operations
structures.
Presumably the expectation was that 32-bit ioctls would be directed
through the regular handler but this is not the case: failing to assign
.compat_ioctl actually means that the compat case is disabled entirely,
and any attempt to submit nvme ioctls from 32-bit userspace fails
outright with -ENOTTY.
For example:
% smartctl -x /dev/nvme0n1
[...]
Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Inappropriate ioctl for device
The blkdev_compat_ptr_ioctl helper can be used to direct compat calls
through the main ioctl handler and makes things work again.
Fixes: 89b3d6e60550 ("nvme: simplify the compat ioctl handling")
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 679c54f2de672b7d79d02f8c4ad483ff6dd8ce2e ]
Use command_id instead of req->tag in trace_nvme_complete_rq(),
because of commit e7006de6c238 ("nvme: code command_id with a genctr
for use authentication after release"), cmd->common.command_id is set to
((genctl & 0xf)< 12 | req->tag), no longer req->tag, which makes cid in
trace_nvme_complete_rq and trace_nvme_setup_cmd are not the same.
Fixes: e7006de6c238 ("nvme: code command_id with a genctr for use authentication after release")
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6961b5e02876b3b47f030a1f1ee8fd3e631ac270 ]
The issue exists when multipath is enabled and the namespace is
shared, but all the other controller checks at nvme_is_unique_nsid()
are false. The reason for this issue is that nvme_is_unique_nsid()
returns false when is called from nvme_mpath_alloc_disk() due to an
uninitialized value of head->shared. The patch fixes it by setting
head->shared before nvme_mpath_alloc_disk() is called.
Fixes: 5974ea7ce0f9 ("nvme: allow duplicate NSIDs for private namespaces")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2d77d2e11c4f1e70a1a24cc8fe63ff3dc9b53ef ]
Lift the check for duplicate identifiers into nvme_init_ns_head, which
avoids pointless error unwinding in case they don't match, and also
matches where we check identifier validity for the multipath case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f7f70f4aa09dc43d7455c060143e86a017c30548 ]
We encountered a problem that the disconnect command hangs.
After analyzing the log and stack, we found that the triggering
process is as follows:
CPU0 CPU1
nvme_rdma_error_recovery_work
nvme_rdma_teardown_io_queues
nvme_do_delete_ctrl nvme_stop_queues
nvme_remove_namespaces
--clear ctrl->namespaces
nvme_start_queues
--no ns in ctrl->namespaces
nvme_ns_remove return(because ctrl is deleting)
blk_freeze_queue
blk_mq_freeze_queue_wait
--wait for ns to unquiesce to clean infligt IO, hang forever
This problem was not found in older kernels because we will flush
err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
seem to be modified for functional reasons, the patch can be revert
to solve the problem.
Revert commit 794a4cb3d2f7 ("nvme: remove the .stop_ctrl callout")
Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 41d07df7de841bfbc32725ce21d933ad358f2844 ]
queue stoppage and inflight requests cancellation is fully fenced from
io_work and thus failing a request from this context. Hence we don't
need to try to guess from the socket retcode if this failure is because
the queue is about to be torn down or not.
We are perfectly safe to just fail it, the request will not be cancelled
later on.
This solves possible very long shutdown delays when the users issues a
'nvme disconnect-all'
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit e1c70d79346356bb1ede3f79436df80917845ab9 upstream.
ADATA IM2P33F8ABR1 reports bogus eui64 values that appear to be the same
across all drives. Quirk them out so they are not marked as "non globally
unique" duplicates.
Co-developed-by: Felipe de Jesus Araujo da Conceição <felipe.conceicao@petrosoftdesign.com>
Signed-off-by: Felipe de Jesus Araujo da Conceição <felipe.conceicao@petrosoftdesign.com>
Signed-off-by: Lamarque V. Souza <lamarque.souza@petrosoftdesign.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ Upstream commit e6487833182a8a0187f0292aca542fc163ccd03e ]
This device shares the PCI ID with the Samsung 970 Evo Plus that
does not need or want the quirks. Move the the quirk entry to the
core table based on the model number instead.
Fixes: bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5a6254d55e2a9f7919ead8580d7aa0c7a382b26a ]
This particular Kioxia device times out and aborts I/O during any load,
but it's more easily observable with discards (fstrim).
The device gets to a state that is also not possible to use
"nvme set-feature" to disable APST.
Booting with nvme_core.default_ps_max_latency=0 solves the issue.
We had a dozen or so of these devices behaving this same way in
customer environments.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1fc766b5c08417248e0008bca14c3572ac0f1c26 ]
This provides more context to users.
Old message:
[ 00.000000] No UUID available providing old NGUID
New message:
[ 00.000000] block nvme0n1: No UUID available providing old NGUID
Fixes: d934f9848a77 ("nvme: provide UUID value to userspace")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 52fde2c07da606f3f120af4f734eadcfb52b04be ]
The nvme specification only requires qword alignment for segment
descriptors, and the driver already guarantees that. The spec has always
allowed user data to be dword aligned, which is what the queue's
attribute is for, so relax the alignment requirement to that value.
While we could allow byte alignment for some controllers when using
SGLs, we still need to support PRP, and that only allows dword.
Fixes: 3b2a1ebceba3 ("nvme: set dma alignment to qword")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 78288665b5d0154978fed431985310cb4f166836 ]
In current implementation we set the non-mdts limits by calling
nvme_init_non_mdts_limits() from nvme_init_ctrl_finish().
This also tries to set the limits for the discovery controller which
has no I/O queues resulting in the warning message reported by the
nvme_log_error() when running blktest nvme/002: -
[ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47
[ 2005.192223] loop: module loaded
[ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0
[ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
<------------------------------SNIP---------------------------------->
[ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997
[ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998
[ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999
[ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn.
*[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR*
[ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery"
[ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after
we verify that I/O queues are created since that is a converging point
for each transport where these limits are actually used.
1. FC :
nvme_fc_create_association()
...
nvme_fc_create_io_queues(ctrl);
...
nvme_start_ctrl()
nvme_scan_queue()
nvme_scan_work()
2. PCIe:-
nvme_reset_work()
...
nvme_setup_io_queues()
nvme_create_io_queues()
nvme_alloc_queue()
...
nvme_start_ctrl()
nvme_scan_queue()
nvme_scan_work()
3. RDMA :-
nvme_rdma_setup_ctrl
...
nvme_rdma_configure_io_queues
...
nvme_start_ctrl()
nvme_scan_queue()
nvme_scan_work()
4. TCP :-
nvme_tcp_setup_ctrl
...
nvme_tcp_configure_io_queues
...
nvme_start_ctrl()
nvme_scan_queue()
nvme_scan_work()
* nvme_scan_work()
...
nvme_validate_or_alloc_ns()
nvme_alloc_ns()
nvme_update_ns_info()
nvme_update_disk_info()
nvme_config_discard() <---
blk_queue_max_write_zeroes_sectors() <---
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit da42761181627e9bdc37d18368b827948a583929 ]
In nvme_alloc_admin_tags, the admin_q can be set to an error (typically
-ENOMEM) if the blk_mq_init_queue call fails to set up the queue, which
is checked immediately after the call. However, when we return the error
message up the stack, to nvme_reset_work the error takes us to
nvme_remove_dead_ctrl()
nvme_dev_disable()
nvme_suspend_queue(&dev->queues[0]).
Here, we only check that the admin_q is non-NULL, rather than not
an error or NULL, and begin quiescing a queue that never existed, leading
to bad / NULL pointer dereference.
Signed-off-by: Kyle Smith <kyles@hpe.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a4a6f3c8f61c3cfbda4998ad94596059ad7e4332 ]
nvme_mpath_init_identify() invoked from nvme_init_identify() fetches a
fresh ANA log from the ctrl. This is essential to have an up to date
path states for both existing namespaces and for those scan_work may
discover once the ctrl is up.
This happens in the following cases:
1) A new ctrl is being connected.
2) An existing ctrl is successfully reconnected.
3) An existing ctrl is being reset.
While in (1) ctrl->namespaces is empty, (2 & 3) may have namespaces, and
nvme_read_ana_log() may call nvme_update_ns_ana_state().
This result in a hang when the ANA state of an existing namespace changes
and makes the disk live: nvme_mpath_set_live() issues IO to the namespace
through the ctrl, which does NOT have IO queues yet.
See sample hang below.
Solution:
- nvme_update_ns_ana_state() to call set_live only if ctrl is live
- nvme_read_ana_log() call from nvme_mpath_init_identify()
therefore only fetches and parses the ANA log;
any erros in this process will fail the ctrl setup as appropriate;
- a separate function nvme_mpath_update()
is called in nvme_start_ctrl();
this parses the ANA log without fetching it.
At this point the ctrl is live,
therefore, disks can be set live normally.
Sample failure:
nvme nvme0: starting error recovery
nvme nvme0: Reconnecting in 10 seconds...
block nvme0n6: no usable path - requeuing I/O
INFO: task kworker/u8:3:312 blocked for more than 122 seconds.
Tainted: G E 5.14.5-1.el7.elrepo.x86_64 #1
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
__schedule+0x2a2/0x7e0
schedule+0x4e/0xb0
io_schedule+0x16/0x40
wait_on_page_bit_common+0x15c/0x3e0
do_read_cache_page+0x1e0/0x410
read_cache_page+0x12/0x20
read_part_sector+0x46/0x100
read_lba+0x121/0x240
efi_partition+0x1d2/0x6a0
bdev_disk_changed.part.0+0x1df/0x430
bdev_disk_changed+0x18/0x20
blkdev_get_whole+0x77/0xe0
blkdev_get_by_dev+0xd2/0x3a0
__device_add_disk+0x1ed/0x310
device_add_disk+0x13/0x20
nvme_mpath_set_live+0x138/0x1b0 [nvme_core]
nvme_update_ns_ana_state+0x2b/0x30 [nvme_core]
nvme_update_ana_state+0xca/0xe0 [nvme_core]
nvme_parse_ana_log+0xac/0x170 [nvme_core]
nvme_read_ana_log+0x7d/0xe0 [nvme_core]
nvme_mpath_init_identify+0x105/0x150 [nvme_core]
nvme_init_identify+0x2df/0x4d0 [nvme_core]
nvme_init_ctrl_finish+0x8d/0x3b0 [nvme_core]
nvme_tcp_setup_ctrl+0x337/0x390 [nvme_tcp]
nvme_tcp_reconnect_ctrl_work+0x24/0x40 [nvme_tcp]
process_one_work+0x1bd/0x360
worker_thread+0x50/0x3d0
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8832cf922151e9dfa2821736beb0ae2dd3968b6e ]
Any attempt to flush kernel-global WQs has possibility of deadlock
so we should simply stop using them, instead introduce nvmet_wq
which is the generic nvmet workqueue for work elements that
don't explicitly require a dedicated workqueue (by the mere fact
that they are using the system_wq).
Changes were done using the following replaces:
- s/schedule_work(/queue_work(nvmet_wq, /g
- s/schedule_delayed_work(/queue_delayed_work(nvmet_wq, /g
- s/flush_scheduled_work()/flush_workqueue(nvmet_wq)/g
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bc360b0b1611566e1bd47384daf49af6a1c51837 ]
Add quirks to not fail the initialization and to have quick resume
latency after cold/warm reboot.
Signed-off-by: Monish Kumar R <monish.kumar.r@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 66dd346b84d79fde20832ed691a54f4881eac20d ]
Qemu unconditionally reports a UUID, which depending on the qemu version
is either all-null (which is incorrect but harmless) or contains a single
bit set for all controllers. In addition it can also optionally report
a eui64 which needs to be manually set. Disable namespace identifiers
for Qemu controlles entirely even if in some cases they could be set
correctly through manual intervention.
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 00ff400e6deee00f7b15e200205b2708b63b8cf6 ]
Add a quirk to disable using and exporting namespace identifiers for
controllers where they are broken beyond repair.
The most directly visible problem with non-unique namespace identifiers
is that they break the /dev/disk/by-id/ links, with the link for a
supposedly unique identifier now pointing to one of multiple possible
namespaces that share the same ID, and a somewhat random selection of
which one actually shows up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 726be2c72efc0a64c206e854b8996ad3ab9c7507 upstream.
commit 2f4c9ba23b88 ("nvme: export zoned namespaces without Zone Append
support read-only") marks zoned namespaces without append support
read-only. It does iso by setting NVME_NS_FORCE_RO in ns->flags in
nvme_update_zone_info and checking for that flag later in
nvme_update_disk_info to mark the disk as read-only.
But commit 73d90386b559 ("nvme: cleanup zone information initialization")
rearranged nvme_update_disk_info to be called before
nvme_update_zone_info and thus not marking the disk as read-only.
The call order cannot be just reverted because nvme_update_zone_info sets
certain queue parameters such as zone_write_granularity that depend on the
prior call to nvme_update_disk_info.
Remove the call to set_disk_ro in nvme_update_disk_info. and call
set_disk_ro after nvme_update_zone_info and nvme_update_disk_info to set
the permission for ZNS drives correctly. The same applies to the
multipath disk path.
Fixes: 73d90386b559 ("nvme: cleanup zone information initialization")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5974ea7ce0f9a5987fc8cf5e08ad6e3e70bb542e upstream.
A NVMe subsystem with multiple controller can have private namespaces
that use the same NSID under some conditions:
"If Namespace Management, ANA Reporting, or NVM Sets are supported, the
NSIDs shall be unique within the NVM subsystem. If the Namespace
Management, ANA Reporting, and NVM Sets are not supported, then NSIDs:
a) for shared namespace shall be unique; and
b) for private namespace are not required to be unique."
Reference: Section 6.1.6 NSID and Namespace Usage; NVM Express 1.4c spec.
Make sure this specific setup is supported in Linux.
Fixes: 9ad1927a3bc2 ("nvme: always search for namespace head")
Signed-off-by: Sungup Moon <sungup.moon@samsung.com>
[hch: refactored and fixed the controller vs subsystem based naming
conflict]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 841aee4d75f18fdfb53935080b03de0c65e9b92c ]
Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
Sockets created by nvme-tcp are not exposed to user-space, and will not
trigger certain code paths that the general socket API exposes.
Lockdep complains about a circular dependency between the socket and
filesystem locks, because setsockopt can trigger a page fault with a
socket lock held, but nvme-tcp sends requests on the socket while file
system locks are held.
======================================================
WARNING: possible circular locking dependency detected
5.15.0-rc3 #1 Not tainted
------------------------------------------------------
fio/1496 is trying to acquire lock:
(sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80
but task is already holding lock:
(&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
which lock already depends on the new lock.
other info that might help us debug this:
chain exists of:
sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&xfs_dir_ilock_class/5);
lock(sb_internal);
lock(&xfs_dir_ilock_class/5);
lock(sk_lock-AF_INET);
*** DEADLOCK ***
6 locks held by fio/1496:
#0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
#1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
#2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
#3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
#4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
#5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
This annotation lets lockdep analyze nvme-tcp controlled sockets
independently of what the user-space sockets API does.
Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/
Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2724cb9f0c406b8fb66efd3aa9e8b3edfd8d5c8 ]
nvme_subsys_check_duplicate_ids should needs to return an error if any of
the identifiers matches, not just if all of them match. But it does not
need to and should not look at the CSI value for this sanity check.
Rewrite the logic to be separate from nvme_ns_ids_equal and optimize it
by reducing duplicate checks for non-present identifiers.
Fixes: ed754e5deeb1 ("nvme: track shared namespaces")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fd8099e7918cd2df39ef306dd1d1af7178a15b81 ]
Pass the actual nvme_ns_ids used for the comparison instead of the
ns_head that isn't needed and use a more descriptive function name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 602e57c9799c19f27e440639deed3ec45cfe1651 upstream.
Commit e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
introduced the NVME_NS_READY flag, which nvme_path_is_disabled() uses
to check if a path can be used or not. We also need to set this flag
for devices that fail the ZNS feature validation and which are available
through passthrough devices only to that they can be used in multipathing
setups.
Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan")
Reported-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7a5428dcb7902700b830e912feee4e845df7c019 upstream.
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb803 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.
Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.
Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b6bb1722f34bbdbabed27acdceaf585d300c5fd2 ]
While nvme_rdma_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ff9fc7ebf5c06de1ef72a69f9b1ab40af8b07f9e ]
While nvme_tcp_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Tested-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0fa0f99fc84e41057cbdd2efbfe91c6b2f47dd9d ]
Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl
readiness for AER submission. This may lead to a use-after-free
condition that was observed with nvme-tcp.
The race condition may happen in the following scenario:
1. driver executes its reset_ctrl_work
2. -> nvme_stop_ctrl - flushes ctrl async_event_work
3. ctrl sends AEN which is received by the host, which in turn
schedules AEN handling
4. teardown admin queue (which releases the queue socket)
5. AEN processed, submits another AER, calling the driver to submit
6. driver attempts to send the cmd
==> use-after-free
In order to fix that, add ctrl state check to validate the ctrl
is actually able to accept the AER submission.
This addresses the above race in controller resets because the driver
during teardown should:
1. change ctrl state to RESETTING
2. flush async_event_work (as well as other async work elements)
So after 1,2, any other AER command will find the
ctrl state to be RESETTING and bail out without submitting the AER.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 63573807b27e0faf8065a28b1bbe1cbfb23c0130 upstream.
AER is not backed by a real request, hence we should not incorrectly
assume that when failing to send a nvme command, it is a normal request
but rather check if this is an aer and if so complete the aer (similar
to the normal completion path).
Cc: stable@vger.kernel.org
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 25e58af4be412d59e056da65cc1cefbd89185bd2 ]
The Intel P4500/P4600 SSDs do not report a subsystem NQN despite claiming
compliance to a standards version where reporting one is required.
Add the IGNORE_DEV_SUBNQN quirk to not fail the initialization of a
second such SSDs in a system.
Signed-off-by: Zheng Wu <wu.zheng@intel.com>
Signed-off-by: Ye Jinhe <jinhe.ye@intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6a51abdeb259a56d95f13cc67e3a0838bcda0377 upstream.
Controller deletion/reset, immediately followed by or concurrent with
a reconnect, is hard failing the connect attempt resulting in a
complete loss of connectivity to the controller.
In the connect request, fabrics looks for an existing controller with
the same address components and aborts the connect if a controller
already exists and the duplicate connect option isn't set. The match
routine filters out controllers that are dead or dying, so they don't
interfere with the new connect request.
When NVME_CTRL_DELETING_NOIO was added, it missed updating the state
filters in the nvmf_ctlr_matches_baseopts() routine. Thus, when in this
new state, it's seen as a live controller and fails the connect request.
Correct by adding the DELETING_NIO state to the match checks.
Fixes: ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")
Cc: <stable@vger.kernel.org> # v5.7+
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c024b226a417c4eb9353ff500b1c823165d4d508 ]
Submit I/O requests with the IOCB_NOWAIT flag set only if
the underlying filesystem supports it.
Fixes: 50a909db36f2 ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>