Hugh Dickins 30514bd2dd sbitmap: fix lockup while swapping
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 .  Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters.  However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 17:58:17 -06:00
..
2022-05-09 17:20:37 -07:00
2021-07-08 11:48:20 -07:00
2021-09-08 11:50:26 -07:00
2021-07-08 11:48:20 -07:00
2021-07-08 11:48:20 -07:00
2021-11-18 13:16:22 -08:00
2021-05-06 19:24:12 -07:00
2022-08-07 17:52:35 -07:00
2018-08-16 12:14:42 -07:00
2021-01-21 14:06:00 -07:00
2022-03-07 12:48:35 -07:00
2021-08-19 09:02:55 +09:00
2022-04-22 21:30:57 +02:00
2021-01-03 20:05:18 -05:00
2022-01-15 08:47:31 -08:00
2022-03-07 12:48:35 -07:00
2022-04-29 14:38:01 -07:00
2022-07-10 13:55:49 -07:00
2021-08-19 09:02:55 +09:00
2021-07-08 11:48:20 -07:00
2021-07-08 11:48:20 -07:00
2018-10-16 13:45:44 +02:00
2022-06-12 14:51:51 +08:00
2021-07-08 11:48:20 -07:00
2021-07-08 11:48:20 -07:00
2021-08-08 13:00:20 +01:00
2021-09-17 13:52:17 +01:00
2021-07-08 11:48:20 -07:00
2021-07-08 11:48:20 -07:00
2022-09-29 17:58:17 -06:00
2021-07-08 11:48:20 -07:00
2022-06-03 10:34:34 -07:00
2022-04-29 14:38:01 -07:00
2021-06-18 11:43:09 +02:00
2021-07-08 11:48:20 -07:00
2022-01-20 08:52:54 +02:00
2018-10-15 16:31:29 -04:00