Commit Graph

798166 Commits

Author SHA1 Message Date
Jens Axboe
58ab5e32e6 sbitmap: silence bogus lockdep IRQ warning
Ming reports that lockdep spews the following trace. What this
essentially says is that the sbitmap swap_lock was used inconsistently
in IRQ enabled and disabled context, and that is usually indicative of a
bug that will cause a deadlock.

For this case, it's a false positive. The swap_lock is used from process
context only, when we swap the bits in the word and cleared mask. We
also end up doing that when we are getting a driver tag, from the
blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
IRQs disabled. However, this isn't from an actual IRQ, it's still
process context.

In lieu of a better way to fix this, simply always disable interrupts
when grabbing the swap_lock if lockdep is enabled.

[  100.967642] ================start test sanity/001================
[  101.238280] null: module loaded
[  106.093735]
[  106.094012] =====================================================
[  106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
[  106.096551] -----------------------------------------------------
[  106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  106.098231] 000000004c43fa71
(&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
[  106.099431]
[  106.099431] and this task is already holding:
[  106.100229] 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.101630] which would create a new lock dependency:
[  106.102326]  (&(&hctx->dispatch_wait_lock)->rlock){....} ->
(&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.103553]
[  106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
[  106.104580]  (&sbq->ws[i].wait){..-.}
[  106.104582]
[  106.104582] ... which became SOFTIRQ-irq-safe at:
[  106.105751]   _raw_spin_lock_irqsave+0x4b/0x82
[  106.106284]   __wake_up_common_lock+0x119/0x1b9
[  106.106825]   sbitmap_queue_wake_up+0x33f/0x383
[  106.107456]   sbitmap_queue_clear+0x4c/0x9a
[  106.108046]   __blk_mq_free_request+0x188/0x1d3
[  106.108581]   blk_mq_free_request+0x23b/0x26b
[  106.109102]   scsi_end_request+0x345/0x5d7
[  106.109587]   scsi_io_completion+0x4b5/0x8f0
[  106.110099]   scsi_finish_command+0x412/0x456
[  106.110615]   scsi_softirq_done+0x23f/0x29b
[  106.111115]   blk_done_softirq+0x2a7/0x2e6
[  106.111608]   __do_softirq+0x360/0x6ad
[  106.112062]   run_ksoftirqd+0x2f/0x5b
[  106.112499]   smpboot_thread_fn+0x3a5/0x3db
[  106.113000]   kthread+0x1d4/0x1e4
[  106.113457]   ret_from_fork+0x3a/0x50
[  106.113969]
[  106.113969] to a SOFTIRQ-irq-unsafe lock:
[  106.114672]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.114674]
[  106.114674] ... which became SOFTIRQ-irq-unsafe at:
[  106.116000] ...
[  106.116003]   _raw_spin_lock+0x33/0x64
[  106.116676]   sbitmap_get+0xd5/0x22c
[  106.117134]   __sbitmap_queue_get+0xe8/0x177
[  106.117731]   __blk_mq_get_tag+0x1e6/0x22d
[  106.118286]   blk_mq_get_tag+0x1db/0x6e4
[  106.118756]   blk_mq_get_driver_tag+0x161/0x258
[  106.119383]   blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.120043]   blk_mq_do_dispatch_sched+0x23a/0x287
[  106.120607]   blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.121234]   __blk_mq_run_hw_queue+0x137/0x17e
[  106.121781]   __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.122366]   blk_mq_run_hw_queue+0x151/0x187
[  106.122887]   blk_mq_sched_insert_requests+0x13f/0x175
[  106.123492]   blk_mq_flush_plug_list+0x7d6/0x81b
[  106.124042]   blk_flush_plug_list+0x392/0x3d7
[  106.124557]   blk_finish_plug+0x37/0x4f
[  106.125019]   read_pages+0x3ef/0x430
[  106.125446]   __do_page_cache_readahead+0x18e/0x2fc
[  106.126027]   force_page_cache_readahead+0x121/0x133
[  106.126621]   page_cache_sync_readahead+0x35f/0x3bb
[  106.127229]   generic_file_buffered_read+0x410/0x1860
[  106.127932]   __vfs_read+0x319/0x38f
[  106.128415]   vfs_read+0xd2/0x19a
[  106.128817]   ksys_read+0xb9/0x135
[  106.129225]   do_syscall_64+0x140/0x385
[  106.129684]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.130292]
[  106.130292] other info that might help us debug this:
[  106.130292]
[  106.131226] Chain exists of:
[  106.131226]   &sbq->ws[i].wait -->
&(&hctx->dispatch_wait_lock)->rlock -->
&(&sb->map[i].swap_lock)->rlock
[  106.131226]
[  106.132865]  Possible interrupt unsafe locking scenario:
[  106.132865]
[  106.133659]        CPU0                    CPU1
[  106.134194]        ----                    ----
[  106.134733]   lock(&(&sb->map[i].swap_lock)->rlock);
[  106.135318]                                local_irq_disable();
[  106.136014]                                lock(&sbq->ws[i].wait);
[  106.136747]
lock(&(&hctx->dispatch_wait_lock)->rlock);
[  106.137742]   <Interrupt>
[  106.138110]     lock(&sbq->ws[i].wait);
[  106.138625]
[  106.138625]  *** DEADLOCK ***
[  106.138625]
[  106.139430] 3 locks held by fio/1043:
[  106.139947]  #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
hctx_lock+0x29/0xe8
[  106.140813]  #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
blk_mq_dispatch_rq_list+0x4ad/0xd7c
[  106.141877]  #2: 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.143267]
[  106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
holding lock:
[  106.144351]  -> (&sbq->ws[i].wait){..-.} ops: 82 {
[  106.144926]     IN-SOFTIRQ-W at:
[  106.145314]                       _raw_spin_lock_irqsave+0x4b/0x82
[  106.146042]                       __wake_up_common_lock+0x119/0x1b9
[  106.146785]                       sbitmap_queue_wake_up+0x33f/0x383
[  106.147567]                       sbitmap_queue_clear+0x4c/0x9a
[  106.148379]                       __blk_mq_free_request+0x188/0x1d3
[  106.149148]                       blk_mq_free_request+0x23b/0x26b
[  106.149864]                       scsi_end_request+0x345/0x5d7
[  106.150546]                       scsi_io_completion+0x4b5/0x8f0
[  106.151367]                       scsi_finish_command+0x412/0x456
[  106.152157]                       scsi_softirq_done+0x23f/0x29b
[  106.152855]                       blk_done_softirq+0x2a7/0x2e6
[  106.153537]                       __do_softirq+0x360/0x6ad
[  106.154280]                       run_ksoftirqd+0x2f/0x5b
[  106.155020]                       smpboot_thread_fn+0x3a5/0x3db
[  106.155828]                       kthread+0x1d4/0x1e4
[  106.156526]                       ret_from_fork+0x3a/0x50
[  106.157267]     INITIAL USE at:
[  106.157713]                      _raw_spin_lock_irqsave+0x4b/0x82
[  106.158542]                      prepare_to_wait_exclusive+0xa8/0x215
[  106.159421]                      blk_mq_get_tag+0x34f/0x6e4
[  106.160186]                      blk_mq_get_request+0x48e/0xaef
[  106.160997]                      blk_mq_make_request+0x27e/0xbd2
[  106.161828]                      generic_make_request+0x4d1/0x873
[  106.162661]                      submit_bio+0x20c/0x253
[  106.163379]                      mpage_bio_submit+0x44/0x4b
[  106.164142]                      mpage_readpages+0x3c2/0x407
[  106.164919]                      read_pages+0x13a/0x430
[  106.165633]                      __do_page_cache_readahead+0x18e/0x2fc
[  106.166530]                      force_page_cache_readahead+0x121/0x133
[  106.167439]                      page_cache_sync_readahead+0x35f/0x3bb
[  106.168337]                      generic_file_buffered_read+0x410/0x1860
[  106.169255]                      __vfs_read+0x319/0x38f
[  106.169977]                      vfs_read+0xd2/0x19a
[  106.170662]                      ksys_read+0xb9/0x135
[  106.171356]                      do_syscall_64+0x140/0x385
[  106.172120]                      entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.173051]   }
[  106.173308]   ... key      at: [<ffffffff85094600>] __key.26481+0x0/0x40
[  106.174219]   ... acquired at:
[  106.174646]    _raw_spin_lock+0x33/0x64
[  106.175183]    blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.175843]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.176518]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.177262]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.177900]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.178591]    blk_mq_run_hw_queue+0x151/0x187
[  106.179207]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.179926]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.180571]    blk_flush_plug_list+0x392/0x3d7
[  106.181187]    blk_finish_plug+0x37/0x4f
[  106.181737]    __se_sys_io_submit+0x171/0x304
[  106.182346]    do_syscall_64+0x140/0x385
[  106.182895]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.183607]
[  106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[  106.184691]    INITIAL USE at:
[  106.185119]                    _raw_spin_lock+0x33/0x64
[  106.185838]                    blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.186697]                    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.187551]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.188481]                    __blk_mq_run_hw_queue+0x137/0x17e
[  106.189307]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.190189]                    blk_mq_run_hw_queue+0x151/0x187
[  106.190989]                    blk_mq_sched_insert_requests+0x13f/0x175
[  106.191902]                    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.192739]                    blk_flush_plug_list+0x392/0x3d7
[  106.193535]                    blk_finish_plug+0x37/0x4f
[  106.194269]                    __se_sys_io_submit+0x171/0x304
[  106.195059]                    do_syscall_64+0x140/0x385
[  106.195794]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.196705]  }
[  106.196950]  ... key      at: [<ffffffff84880620>] __key.51231+0x0/0x40
[  106.197853]  ... acquired at:
[  106.198270]    lock_acquire+0x280/0x2f3
[  106.198806]    _raw_spin_lock+0x33/0x64
[  106.199337]    sbitmap_get+0xd5/0x22c
[  106.199850]    __sbitmap_queue_get+0xe8/0x177
[  106.200450]    __blk_mq_get_tag+0x1e6/0x22d
[  106.201035]    blk_mq_get_tag+0x1db/0x6e4
[  106.201589]    blk_mq_get_driver_tag+0x161/0x258
[  106.202237]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
[  106.202902]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.203572]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.204316]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.204956]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.205649]    blk_mq_run_hw_queue+0x151/0x187
[  106.206269]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.206997]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.207644]    blk_flush_plug_list+0x392/0x3d7
[  106.208264]    blk_finish_plug+0x37/0x4f
[  106.208814]    __se_sys_io_submit+0x171/0x304
[  106.209415]    do_syscall_64+0x140/0x385
[  106.209965]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.210684]
[  106.210904]
[  106.210904] the dependencies between the lock to be acquired
[  106.210905]  and SOFTIRQ-irq-unsafe lock:
[  106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
[  106.213393]    HARDIRQ-ON-W at:
[  106.213840]                     _raw_spin_lock+0x33/0x64
[  106.214570]                     sbitmap_get+0xd5/0x22c
[  106.215282]                     __sbitmap_queue_get+0xe8/0x177
[  106.216086]                     __blk_mq_get_tag+0x1e6/0x22d
[  106.216876]                     blk_mq_get_tag+0x1db/0x6e4
[  106.217627]                     blk_mq_get_driver_tag+0x161/0x258
[  106.218465]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.219326]                     blk_mq_do_dispatch_sched+0x23a/0x287
[  106.220198]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.221138]                     __blk_mq_run_hw_queue+0x137/0x17e
[  106.221975]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.222874]                     blk_mq_run_hw_queue+0x151/0x187
[  106.223686]                     blk_mq_sched_insert_requests+0x13f/0x175
[  106.224597]                     blk_mq_flush_plug_list+0x7d6/0x81b
[  106.225444]                     blk_flush_plug_list+0x392/0x3d7
[  106.226255]                     blk_finish_plug+0x37/0x4f
[  106.227006]                     read_pages+0x3ef/0x430
[  106.227717]                     __do_page_cache_readahead+0x18e/0x2fc
[  106.228595]                     force_page_cache_readahead+0x121/0x133
[  106.229491]                     page_cache_sync_readahead+0x35f/0x3bb
[  106.230373]                     generic_file_buffered_read+0x410/0x1860
[  106.231277]                     __vfs_read+0x319/0x38f
[  106.231986]                     vfs_read+0xd2/0x19a
[  106.232666]                     ksys_read+0xb9/0x135
[  106.233350]                     do_syscall_64+0x140/0x385
[  106.234097]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.235012]    SOFTIRQ-ON-W at:
[  106.235460]                     _raw_spin_lock+0x33/0x64
[  106.236195]                     sbitmap_get+0xd5/0x22c
[  106.236913]                     __sbitmap_queue_get+0xe8/0x177
[  106.237715]                     __blk_mq_get_tag+0x1e6/0x22d
[  106.238488]                     blk_mq_get_tag+0x1db/0x6e4
[  106.239244]                     blk_mq_get_driver_tag+0x161/0x258
[  106.240079]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.240937]                     blk_mq_do_dispatch_sched+0x23a/0x287
[  106.241806]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.242751]                     __blk_mq_run_hw_queue+0x137/0x17e
[  106.243579]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.244469]                     blk_mq_run_hw_queue+0x151/0x187
[  106.245277]                     blk_mq_sched_insert_requests+0x13f/0x175
[  106.246191]                     blk_mq_flush_plug_list+0x7d6/0x81b
[  106.247044]                     blk_flush_plug_list+0x392/0x3d7
[  106.247859]                     blk_finish_plug+0x37/0x4f
[  106.248749]                     read_pages+0x3ef/0x430
[  106.249463]                     __do_page_cache_readahead+0x18e/0x2fc
[  106.250357]                     force_page_cache_readahead+0x121/0x133
[  106.251263]                     page_cache_sync_readahead+0x35f/0x3bb
[  106.252157]                     generic_file_buffered_read+0x410/0x1860
[  106.253084]                     __vfs_read+0x319/0x38f
[  106.253808]                     vfs_read+0xd2/0x19a
[  106.254488]                     ksys_read+0xb9/0x135
[  106.255186]                     do_syscall_64+0x140/0x385
[  106.255943]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.256867]    INITIAL USE at:
[  106.257300]                    _raw_spin_lock+0x33/0x64
[  106.258033]                    sbitmap_get+0xd5/0x22c
[  106.258747]                    __sbitmap_queue_get+0xe8/0x177
[  106.259542]                    __blk_mq_get_tag+0x1e6/0x22d
[  106.260320]                    blk_mq_get_tag+0x1db/0x6e4
[  106.261072]                    blk_mq_get_driver_tag+0x161/0x258
[  106.261902]                    blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.262762]                    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.263626]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.264571]                    __blk_mq_run_hw_queue+0x137/0x17e
[  106.265409]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.266302]                    blk_mq_run_hw_queue+0x151/0x187
[  106.267111]                    blk_mq_sched_insert_requests+0x13f/0x175
[  106.268028]                    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.268878]                    blk_flush_plug_list+0x392/0x3d7
[  106.269694]                    blk_finish_plug+0x37/0x4f
[  106.270432]                    read_pages+0x3ef/0x430
[  106.271139]                    __do_page_cache_readahead+0x18e/0x2fc
[  106.272040]                    force_page_cache_readahead+0x121/0x133
[  106.272932]                    page_cache_sync_readahead+0x35f/0x3bb
[  106.273811]                    generic_file_buffered_read+0x410/0x1860
[  106.274709]                    __vfs_read+0x319/0x38f
[  106.275407]                    vfs_read+0xd2/0x19a
[  106.276074]                    ksys_read+0xb9/0x135
[  106.276764]                    do_syscall_64+0x140/0x385
[  106.277500]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.278417]  }
[  106.278676]  ... key      at: [<ffffffff85094640>] __key.26212+0x0/0x40
[  106.279586]  ... acquired at:
[  106.280026]    lock_acquire+0x280/0x2f3
[  106.280559]    _raw_spin_lock+0x33/0x64
[  106.281101]    sbitmap_get+0xd5/0x22c
[  106.281610]    __sbitmap_queue_get+0xe8/0x177
[  106.282221]    __blk_mq_get_tag+0x1e6/0x22d
[  106.282809]    blk_mq_get_tag+0x1db/0x6e4
[  106.283368]    blk_mq_get_driver_tag+0x161/0x258
[  106.284018]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
[  106.284685]    blk_mq_do_dispatch_sched+0x23a/0x287
[  106.285371]    blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.286135]    __blk_mq_run_hw_queue+0x137/0x17e
[  106.286806]    __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.287515]    blk_mq_run_hw_queue+0x151/0x187
[  106.288149]    blk_mq_sched_insert_requests+0x13f/0x175
[  106.289041]    blk_mq_flush_plug_list+0x7d6/0x81b
[  106.289912]    blk_flush_plug_list+0x392/0x3d7
[  106.290590]    blk_finish_plug+0x37/0x4f
[  106.291238]    __se_sys_io_submit+0x171/0x304
[  106.291864]    do_syscall_64+0x140/0x385
[  106.292534]    entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:43:20 -07:00
Dan Carpenter
29cadd2bb6 scsi: Fix a harmless double shift bug
Smatch generates a warning:

    drivers/scsi/scsi_lib.c:1656 scsi_mq_done() warn: test_bit() takes a bit number

The problem is that SCMD_STATE_COMPLETE is supposed to be bit number 0
and not a mask like "(1 << 0)".  It is used like this:

        if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))

The test_and_set_bit() has a shift built in so it's a double left shift
and uses bit number 1 instead of number 0.  This bug is harmless because
it's done consistently and it doesn't clash with any other flags.

Fixes: f1342709d1 ("scsi: Do not rely on blk-mq for double completions")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:27:01 -07:00
Israel Rukshin
3236b458c4 nvme: remove unused function nvme_ctrl_ready
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Keith Busch
49cd84b6f8 nvme: implement Enhanced Command Retry
A controller may have an internal state that is not able to successfully
process commands for a short duration. In such states, an immediate
command requeue is expected to fail. The driver may exceed its max
retry count, which permanently ends the command in failure when the same
command would succeed after waiting for the controller to be ready.

NVMe ratified TP 4033 provides a delay hint in the completion status
code for failed commands. Implement the retry delay based on the command
completion status and the controller's requested delay.

Note that requeued commands are handled per request_queue, not per
individual request. If multiple commands fail, the controller should
consistently report the desired delay time for retryable commands in
all CQEs, otherwise the requeue list may be kicked too soon.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Chaitanya Kulkarni
5a3a6d6965 nvmet: fix the structure member indentation
This is a cleanup patch which fixes the structure member indentation
introduced by the p2p:

commit c6925093d0 ("nvmet: Optionally use PCI P2P memory").
We don't change any functionality in this patch.

This is needed so that any future members will also follow the uniform
indentation.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Chaitanya Kulkarni
cb019da3da nvmet: use unlikely for req status check
This patch adds unlikely in the nvmet request completion path for the
status check in the low level function __nvmet_req_complete.
This is helpful in the scenario where host and target connection is
working smoothly.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Israel Rukshin
ad1f824948 nvmet-rdma: Add unlikely for response allocated check
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Israel Rukshin
5c4072ad1c nvme: Remove unused forward declaration
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg
8154ed730b nvme: disable fabrics SQ flow control when asked by the user
As for now, we don't care about sq_head pointer updates anyway, so
at least allow the controller to micro-optimize by omiting this update.

Note that we will probably need to support it when a controller
that requires this comes along.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg
9b95d2fb85 nvmet: expose support for fabrics SQ flow control disable in treq
Technical Proposal introduces an indication for SQ flow control
disable support. Expose it since we are able to operate in this mode.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg
0445e1b5a2 nvmet: don't override treq upon modification.
Only override the allowed parts of it.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[hch: slight tweak to the NVME_TREQ_SECURE_CHANNEL_MASK definition]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg
e6a622fd6d nvmet: support fabrics sq flow control
Technical proposal 8005 "fabrics SQ flow control" introduces a mode
where a host and controller agree to omit sq_head pointer updates
when sending nvme completions.

In case the host indicated desire to operate in this mode (connect attribute)
the controller will return back a connect completion with sq_head value
of 0xffff as indication that it will omit sq_head pointer updates.

This mode saves us an atomic update in the I/O path.

Reviewed-by: Hannes Reinecke <hare@suse.com>
[hch: suggested better implementation]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
James Smart
6e2e312ea7 nvmet-fc: remove the IN_ISR deferred scheduling options
All target lldd's call the cmd receive and op completions in non-isr
thread contexts. As such the IN_ISR options are not necessary.
Remove the functionality and flags, which also removes cpu assignments
to queues.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Christoph Hellwig
03198c4d9f nvmet: mark nvmet_genctr static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Jay Sternberg
b662a07857 nvmet: enable Discovery Controller AENs
Add functions to find connections requesting Discovery Change events
and send a notification to hosts that maintain an explicit persistent
connection and have and active Asynchronous Event Request pending.
Only Hosts that have access to the Subsystem effected by the change
will receive notifications of Discovery Change event.

Call these functions each time there is a configfs change that effects
the Discover Log Pages.

Set the OAES field in the Identify Controller response to advertise the
support for Asynchronous Event Notifications.

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Phil Cayton <phil.cayton@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg
253928eec6 nvmet: allow host connect even if no allowed subsystems are exported
It is perfectly valid that a host connects to a discovery subsystem
and gets an empty discovery log page since no subsystems are
provisioned to it. No reason to disallow connecting to the discovery
subsystem all together.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Phil Cayton <phil.cayton@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Jay Sternberg
6a8ec0ac5e nvmet: add support to Discovery controllers for commands
Add custom get/set features to commands allowed by Discovery controllers.

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Jay Sternberg
f301c2b136 nvmet: add defines for discovery change async events
Add AEN/AER values as defined by the specification

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Jay Sternberg
90107455cc nvmet: make kato and AEN processing for use by other controllers
Make common process of get/set features available to other controllers by
making simple functions static inline and others not static and prototypes
in nvmet.h file

Also remove static from nvmet_execute_async_event and add prototype to
nvmet.h to allow used by other controllers

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Jay Sternberg
f9362ac173 nvmet: allow Keep Alive for Discovery controller
Per change to specification allowing Discovery controllers to have
explicit persistent connections, remove restriction on Discovery
controllers allowing kato on connect.

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Jay Sternberg
7114ddeb40 nvmet: change aen mask functions to use bit numbers
Functions nvmet_aen_disabled and nvmet_clear_aen were using
values not bit numbers ie 1 << 9 not 9 for bit function clear_bit
and test_and_set_bit.

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Phil Cayton <phil.cayton@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Jay Sternberg
6c8312ad50 nvmet: provide aen bit functions for multiple controller types
Move nvmet_aen_disabled and nvmet_clear_aen in preparation for other types
of controllers to use, initially the discovery controller.

Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Chaitanya Kulkarni
50a909db36 nvmet: use IOCB_NOWAIT for file-ns buffered I/O
This patch optimizes read command behavior when file-ns configured
with buffered I/O. Instead of offloading the buffered I/O read operations
to the worker threads, we first issue the read operation with IOCB_NOWAIT
and try and access the data from the cache. Here we only offload the
request to the worker thread and complete the request in the worker
thread context when IOCB_NOWAIT request fails.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Sagi Grimberg
c09305ae49 nvmet: support for traffic based keep-alive
A controller that supports traffic based keep-alive can restart the keep
alive timer even when no keep-alive was not received in the kato period
as long as other admin or I/O commands were received.  For each command
set ctrl->cmd_seen to true, and when keep-alive timer expires, if any
commands were seen, resched ka_work instead of escalating to a fatal
error.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:56 -07:00
Sagi Grimberg
6e3ca03ee9 nvme: support traffic based keep-alive
If the controller supports traffic based keep alive, we restart the keep
alive timer if any admin or io commands was completed during the kato
period.  This prevents a possible starvation of keep alive commands in
the presence of heavy traffic as in such case, we already have a health
indication from the host perspective.

Only set a comp_seen indicator in case the controller supports keep
alive to minimize the overhead for pci controllers.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Sagi Grimberg
3e53ba38a9 nvme: cache controller attributes
We get the controller attributes in identify, cache them as we'll need
them for traffic based keep alive support.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Sagi Grimberg
12b2117161 nvme: introduce ctrl attributes enumeration
We are growing more controller attributes, so use a proper enumeration
for it.  For now just add the 128-bit hostid which we support.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Hannes Reinecke
103e515efa nvme: add a numa_node field to struct nvme_ctrl
Instead of directly poking into the struct device add a new numa_node
field to struct nvme_ctrl.  This allows fabrics drivers where ctrl->dev
is a virtual device to support NUMA affinity as well.

Also expose the field as a sysfs attribute, and populate it for the
RDMA and FC transports.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Chaitanya Kulkarni
1190203555 nvme: consolidate memset calls in the nvme_setup_cmd path
In function nvme_setup_cmd() we call command specific setup function
for flush, rw, and discard. Instead of calling memset in each function
lets call it once in the parent function.

This is purely code cleanup patch and it does not change any existing
functionality.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Ming Lei
5938870247 blk-mq: re-build queue map in case of kdump kernel
Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[    4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[    4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[    4.444681] PGD 0 P4D 0
[    4.445367] Oops: 0000 [#1] SMP NOPTI
[    4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[    4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[    4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[    4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[    4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[    4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
[    4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
[    4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
[    4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
[    4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
[    4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
[    4.469220] FS:  0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
[    4.471554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
[    4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.477061] PKRU: 55555554
[    4.477464] Call Trace:
[    4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[    4.479595]  blk_mq_init_queue+0x32/0x4e
[    4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[    4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[    4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[    4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[    4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[    4.483930]  ? try_to_wake_up+0x38d/0x3b3
[    4.484478]  ? process_one_work+0x179/0x2fc
[    4.485118]  process_one_work+0x1d3/0x2fc
[    4.485655]  ? rescuer_thread+0x2ae/0x2ae
[    4.486196]  worker_thread+0x1e9/0x2be
[    4.486841]  kthread+0x115/0x11d
[    4.487294]  ? kthread_park+0x76/0x76
[    4.487784]  ret_from_fork+0x3a/0x50
[    4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
[    4.489428] Dumping ftrace buffer:
[    4.489939]    (ftrace buffer empty)
[    4.490492] CR2: 0000000000000098
[    4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Cc: David Milburn <dmilburn@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Dennis Zhou
4705de735b blkcg: put back rcu lock in blkcg_bio_issue_check()
I was a little overzealous in removing the rcu_read_lock() call from
blkcg_bio_issue_check() and it broke blk-throttle. Put it back.

Fixes: e35403a034bf ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Josef Bacik
d3fcdff190 block: convert io-latency to use rq_qos_wait
Now that we have this common helper, convert io-latency over to use it
as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Josef Bacik
b6c7b58f5f block: convert wbt_wait() to use rq_qos_wait()
Now that we have rq_qos_wait() in place, convert wbt_wait() over to
using it with it's specific callbacks.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Josef Bacik
84f603246d block: add rq_qos_wait to rq_qos
Originally when I split out the common code from blk-wbt into rq_qos I
left the wbt_wait() where it was and simply copied and modified it
slightly to work for io-latency.  However they are both basically the
same thing, and as time has gone on wbt_wait() has ended up much smarter
and kinder than it was when I copied it into io-latency, which means
io-latency has lost out on these improvements.

Since they are the same thing essentially except for a few minor things,
create rq_qos_wait() that replicates what wbt_wait() currently does with
callbacks that can be passed in for the snowflakes to do their own thing
as appropriate.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Dennis Zhou
7754f669ff blkcg: rename blkg_try_get() to blkg_tryget()
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or %NULL.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Dennis Zhou
7fcf2b033b blkcg: change blkg reference counting to use percpu_ref
Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
and blkg_put on the hot path. Switch over the refcnt in blkg to use
percpu_ref.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Dennis Zhou
6f70fb6618 blkcg: remove bio_disassociate_task()
Now that a bio only holds a blkg reference, so clean up is simply
putting back that reference. Remove bio_disassociate_task() as it just
calls bio_disassociate_blkg() and call the latter directly.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:38 -07:00
Dennis Zhou
fc5a828bfa blkcg: remove additional reference to the css
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.

Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
db6638d7d1 blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.

This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
fd42df305f blkcg: associate writeback bios with a blkg
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg(). In
this patch, wbc_init_bio() now requires a bio to have a device
associated with it.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
6a7f6d86a5 blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
e439bedf6b blkcg: consolidate bio_issue_init() to be a part of core
bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
5cdf2e3fea blkcg: associate blkg when associating a device
Previously, blkg association was handled by controller specific code in
blk-throttle and blk-iolatency. However, because a blkg represents a
relationship between a blkcg and a request_queue, it makes sense to keep
the blkg->q and bio->bi_disk->queue consistent.

This patch moves association into the bio_set_dev macro(). This should
cover the majority of cases where the device is set/changed keeping the
two pointers consistent. Fallback code is added to
blkcg_bio_issue_check() to catch any missing paths.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
892ad71f62 dm: set the static flush bio device on demand
The next patch changes the macro bio_set_dev() to associate a bio with a
blkg based on the device set. However, dm creates a static bio to be
used as the basis for cloning empty flush bios on creation. The
bio_set_dev() call in alloc_dev() will cause problems with the next
patch adding association to bio_set_dev() because the call is before the
bdev is associated with a gendisk (bd_disk is %NULL). To get around
this, set the device on the static bio every time and use that to clone
to the other bios.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
2268c0feb0 blkcg: introduce common blkg association logic
There are 3 ways blkg association can happen: association with the
current css, with the page css (swap), or from the wbc css (writeback).

This patch handles how association is done for the first case where we
are associating bsaed on the current css. If there is already a blkg
associated, the css will be reused and association will be redone as the
request_queue may have changed.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:36 -07:00
Dennis Zhou
beea9da07d blkcg: convert blkg_lookup_create() to find closest blkg
There are several scenarios where blkg_lookup_create() can fail such as
the blkcg dying, request_queue is dying, or simply being OOM. Most
handle this by simply falling back to the q->root_blkg and calling it a
day.

This patch implements the notion of closest blkg. During
blkg_lookup_create(), if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest() is introduced and used
during association so a bio is always attached to a blkg.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:36 -07:00
Dennis Zhou
b978962ad4 blkcg: update blkg_lookup_create() to do locking
To know when to create a blkg, the general pattern is to do a
blkg_lookup() and if that fails, lock and do the lookup again, and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.

This changes blkg_lookup_create() to do locking and implement this
pattern. The old blkg_lookup_create() is renamed to
__blkg_lookup_create().  If a call site wants to do its own error
handling or already owns the queue lock, they can use
__blkg_lookup_create(). This will be used in upcoming patches.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:36 -07:00
Dennis Zhou
0fe061b9f0 blkcg: fix ref count issue with bio_blkcg() using task_css
The bio_blkcg() function turns out to be inconsistent and consequently
dangerous to use. The first part returns a blkcg where a reference is
owned by the bio meaning it does not need to be rcu protected. However,
the third case, the last line, is problematic:

	return css_to_blkcg(task_css(current, io_cgrp_id));

This can race against task migration and the cgroup dying. It is also
semantically different as it must be called rcu protected and is
susceptible to failure when trying to get a reference to it.

This patch adds association ahead of calling bio_blkcg() rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg(). In blk-iolatency, association is
moved above the bio_blkcg() call to ensure it will not return %NULL.

BFQ uses the old bio_blkcg() function, but I do not want to address it
in this series due to the complexity. I have created a private version
documenting the inconsistency and noting not to use it.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:36 -07:00
Jens Axboe
6e0de61107 blk-mq: remove QUEUE_FLAG_POLL from default MQ flags
We only support polling if we have poll queues now, but the flag is
being set by default. Remove the default QUEUE_FLAG_POLL setting, we'll
set it in blk_mq_init_allocated_queue() if we have poll queues available
for this device.

Fixes: 6544d229bf ("block: enable polling by default if a poll map is initalized")
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:36 -07:00
Christoph Hellwig
6544d229bf block: enable polling by default if a poll map is initalized
If the user did setup polling in the driver we should not require
another know in the block layer to enable it.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04 11:38:19 -07:00