linux/drivers/nvme
Sagi Grimberg a1ae8d4d9b nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
        inflight or scheduled (specifically also .stop_ctrl
        which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
        to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

Fixes: b435ecea2a ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432cb ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
..
common nvmet-auth: fix a couple of spelling mistakes 2022-08-02 17:22:51 -06:00
host nvme-rdma: fix possible hang caused during ctrl deletion 2022-10-12 11:35:46 +02:00
target for-6.1/block-2022-10-03 2022-10-07 09:19:14 -07:00
Kconfig nvme: implement In-Band authentication 2022-08-02 17:14:49 -06:00
Makefile nvme: implement In-Band authentication 2022-08-02 17:14:49 -06:00