linux/drivers/nvme/host
Sagi Grimberg c4abd87571 nvme-tcp: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
	inflight or scheduled (specifically also .stop_ctrl
	which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
	to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

[1]:
--
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? scan_shadow_nodes+0x40/0x40
 schedule+0x55/0xe0
 io_schedule+0x16/0x40
 do_read_cache_page+0x55d/0x850
 ? __page_cache_alloc+0x90/0x90
 read_cache_page+0x12/0x20
 read_part_sector+0x3f/0x110
 amiga_partition+0x3d/0x3e0
 ? osf_partition+0x33/0x220
 ? put_partition+0x90/0x90
 bdev_disk_changed+0x1fe/0x4d0
 blkdev_get_whole+0x7b/0x90
 blkdev_get_by_dev+0xda/0x2d0
 device_add_disk+0x356/0x3b0
 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
 ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
 nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
 nvme_mpath_add_disk+0x120/0x160 [nvme_core]
 nvme_alloc_ns+0x594/0xa00 [nvme_core]
 nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
 ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
 nvme_scan_work+0x281/0x410 [nvme_core]
 process_one_work+0x1be/0x380
 worker_thread+0x37/0x3b0
 ? process_one_work+0x380/0x380
 kthread+0x12d/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
      Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D
 stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? sched_clock+0x9/0x10
 schedule+0x55/0xe0
 schedule_timeout+0x24b/0x2e0
 ? try_to_wake_up+0x358/0x510
 ? finish_task_switch+0x88/0x2c0
 wait_for_completion+0xa5/0x110
 __flush_work+0x144/0x210
 ? worker_attach_to_pool+0xc0/0xc0
 flush_work+0x10/0x20
 nvme_remove_namespaces+0x41/0xf0 [nvme_core]
 nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
 nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
 dev_attr_store+0x14/0x30
 sysfs_kf_write+0x38/0x50
 kernfs_fop_write_iter+0x146/0x1d0
 new_sync_write+0x114/0x1b0
 ? intel_pmu_handle_irq+0xe0/0x420
 vfs_write+0x18d/0x270
 ksys_write+0x61/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
--

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
..
apple.c nvme-apple: stop casting function pointer signatures 2022-08-02 17:22:51 -06:00
auth.c nvme-auth: Diffie-Hellman key exchange support 2022-08-02 17:14:49 -06:00
constants.c nvme-pci: print the command name of aborted commands 2022-08-02 17:22:48 -06:00
core.c for-6.1/passthrough-2022-10-04 2022-10-07 09:35:50 -07:00
fabrics.c nvme: consider also host_iface when checking ip options 2022-09-19 17:55:24 +02:00
fabrics.h nvme: implement In-Band authentication 2022-08-02 17:14:49 -06:00
fault_inject.c block: remove the ->rq_disk field in struct request 2021-11-29 06:41:29 -07:00
fc.c nvme-fc: use the tagset alloc/free helpers 2022-09-27 14:44:17 +02:00
fc.h nvme-fc: Update header and host for common definitions for LS handling 2020-05-09 16:18:33 -06:00
hwmon.c nvme-hwmon: Return error code when registration fails 2021-03-05 13:41:03 +01:00
ioctl.c for-6.1/block-2022-10-03 2022-10-07 09:19:14 -07:00
Kconfig nvme-auth: Diffie-Hellman key exchange support 2022-08-02 17:14:49 -06:00
Makefile nvme: don't always build constants.o 2022-08-02 17:22:48 -06:00
multipath.c nvme: wire up async polling for io passthrough commands 2022-09-21 10:30:42 -06:00
nvme.h for-6.1/block-2022-10-03 2022-10-07 09:19:14 -07:00
pci.c for-6.1/passthrough-2022-10-04 2022-10-07 09:35:50 -07:00
rdma.c nvme-rdma: fix possible hang caused during ctrl deletion 2022-10-12 11:35:46 +02:00
tcp.c nvme-tcp: fix possible hang caused during ctrl deletion 2022-10-12 11:35:46 +02:00
trace.c nvme: implement In-Band authentication 2022-08-02 17:14:49 -06:00
trace.h nvme: use command_id instead of req->tag in trace_nvme_complete_rq() 2022-08-02 17:22:46 -06:00
zns.c block: pass a gendisk to blk_queue_max_open_zones and blk_queue_max_active_zones 2022-07-06 06:46:26 -06:00