linux/block
Gabriel Krisman Bertazi 71f79fb317 blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
..
partitions block: atari: Return early for unsupported sector size 2016-07-13 09:31:44 -07:00
badblocks.c block, badblocks: introduce devm_init_badblocks 2016-01-09 08:39:04 -08:00
bio-integrity.c Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block 2016-07-26 15:37:51 -07:00
bio.c block: add missing group association in bio-cloning functions 2016-08-04 14:19:16 -06:00
blk-cgroup.c block/blk-cgroup.c: Declare local symbols static 2016-06-14 09:09:33 -06:00
blk-core.c Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block 2016-07-26 15:37:51 -07:00
blk-exec.c block: Fix spelling in a source code comment 2016-07-20 21:28:22 -06:00
blk-flush.c block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH 2016-06-07 13:41:38 -06:00
blk-integrity.c block, libnvdimm, nvme: provide a built-in blk_integrity nop profile 2015-10-21 14:43:45 -06:00
blk-ioc.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-lib.c Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block 2016-07-26 15:37:51 -07:00
blk-map.c block: simplify and export blk_rq_append_bio 2016-07-20 17:38:32 -06:00
blk-merge.c Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block 2016-07-26 15:37:51 -07:00
blk-mq-cpu.c blk-mq: add file comments and update copyright notices 2014-05-28 10:15:41 -06:00
blk-mq-cpumap.c blk-mq: Avoid memoryless numa node encoded in hctx numa_node 2015-12-03 09:56:27 -07:00
blk-mq-sysfs.c blk-mq: Use proper cpumask iterator 2016-03-20 09:34:02 -06:00
blk-mq-tag.c blk-mq: Introduce blk_mq_reinit_tagset 2016-07-08 08:38:49 -06:00
blk-mq-tag.h blk-mq: factor out a helper to iterate all tags for a request_queue 2015-10-01 10:10:57 +02:00
blk-mq.c blk-mq: Allow timeouts to run while queue is freezing 2016-08-04 14:19:16 -06:00
blk-mq.h blk-mq: dynamic h/w context count 2016-02-09 12:42:17 -07:00
blk-settings.c block: kill off q->flush_flags 2016-04-13 13:33:19 -06:00
blk-softirq.c block: fix regression with block enabled tagging 2014-04-09 21:54:06 -06:00
blk-sysfs.c block: expose QUEUE_FLAG_DAX in sysfs 2016-07-20 21:01:08 -06:00
blk-tag.c block: support different tag allocation policy 2015-01-23 14:15:46 -07:00
blk-throttle.c blkcg: kill unused field nr_undestroyed_grps 2016-08-04 14:19:16 -06:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk.h block: simplify and export blk_rq_append_bio 2016-07-20 17:38:32 -06:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c bsg: Remove unused function bsg_goose_queue() 2012-12-06 14:33:02 +01:00
bsg.c block: Simplify bsg complete all 2015-02-04 09:57:52 -07:00
cfq-iosched.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
cmdline-parser.c block: remove unrelated header files and export symbol 2014-01-21 20:18:26 -08:00
compat_ioctl.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
deadline-iosched.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
elevator.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
genhd.c block: fix use-after-free in seq file 2016-08-04 14:19:16 -06:00
ioctl.c DAX error handling for 4.7 2016-05-26 19:34:26 -07:00
ioprio.c block: fix use-after-free in sys_ioprio_get() 2016-07-01 08:39:24 -06:00
Kconfig block: remove BLK_DEV_DAX config option 2016-08-04 08:50:07 -04:00
Kconfig.iosched blkcg: make CONFIG_BLK_CGROUP bool 2012-03-06 21:27:21 +01:00
Makefile Initial roundup of 4.5 merge window patches 2016-01-23 18:45:06 -08:00
noop-iosched.c elevator: use list_{first,prev,next}_entry 2015-11-16 15:21:48 -07:00
partition-generic.c block/partition-generic.c: Remove a set-but-not-used variable 2016-06-14 09:09:15 -06:00
scsi_ioctl.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
t10-pi.c block: Consolidate static integrity profile properties 2015-10-21 14:42:38 -06:00