Commit Graph

6048 Commits

Author SHA1 Message Date
0a5aa8d161 block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection
Commit 9d497e2941 ("block: don't protect submit_bio_checks by
q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle
calls out of q_usage_counter protection. However, these functions require
q_usage_counter protection. The blk_mq_attempt_bio_merge call without
the protection resulted in blktests block/005 failure with KASAN null-
ptr-deref or use-after-free at bio merge. The rq_qos_throttle call
without the protection caused kernel hang at qos throttle.

To fix the failures, move the blk_mq_attempt_bio_merge and
rq_qos_throttle calls back to q_usage_counter protection.

Fixes: 9d497e2941 ("block: don't protect submit_bio_checks by q_usage_counter")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08 17:48:39 -07:00
bb49c6fa8b block: clear iocb->private in blkdev_bio_end_io_async()
iocb_bio_iopoll() expects iocb->private to be cleared before
releasing the bio.

We already do this in blkdev_bio_end_io(), but we forgot in the
recently added blkdev_bio_end_io_async().

Fixes: 54a88eb838 ("block: add single bio async direct IO helper")
Cc: asml.silence@gmail.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211090136.44471-1-sgarzare@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-22 06:59:49 -07:00
e92bc4cd34 block/wbt: fix negative inflight counter when remove scsi device
Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.

The following is the scenario that triggered the problem.

T1                          T2                           T3
                            elevator_switch_mq
                            bfq_init_queue
                            wbt_disable_default <= Set
                            rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
                                                         scsi_remove_device
                                                         sd_remove
                                                         del_gendisk
                                                         blk_unregister_queue
                                                         elv_unregister_queue
                                                         wbt_enable_default
                                                         <= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.

Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.

Fixes: 76a8040817 ("blk-wbt: make sure throttle is enabled properly")

Remove oneline stale comment, and kill one oneshot local variable.

Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17 07:54:03 -07:00
7a5428dcb7 block: fix surprise removal for drivers calling blk_set_queue_dying
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.

Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.

Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17 07:54:03 -07:00
cc8f7fe1f5 block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern
Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize
the buffer of a bio.

Signed-off-by: Haimin Zhang <tcs.kernel@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216084038.15635-1-tcs.kernel@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17 07:54:03 -07:00
a12821d5e0 block: Add handling for zone append command in blk_complete_request
Zone append command needs special handling to update the bi_sector
field in the bio struct with the actual position of the data in the
device. It is stored in __sector field of the request struct.

Fixes: 5581a5ddfe ("block: add completion handler for fast path")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220211093425.43262-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-11 09:56:13 -07:00
b13e0c7185 block: bio-integrity: Advance seed correctly for larger interval sizes
Commit 309a62fa3a ("bio-integrity: bio_integrity_advance must update
integrity seed") added code to update the integrity seed value when
advancing a bio. However, it failed to take into account that the
integrity interval might be larger than the 512-byte block layer
sector size. This broke bio splitting on PI devices with 4KB logical
blocks.

The seed value should be advanced by bio_integrity_intervals() and not
the number of sectors.

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
Fixes: 309a62fa3a ("bio-integrity: bio_integrity_advance must update integrity seed")
Tested-by: Dmitry Ivanov <dmitry.ivanov2@hpe.com>
Reported-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220204034209.4193-1-martin.petersen@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-03 21:09:24 -07:00
3e1f941dd9 block: fix DIO handling regressions in blkdev_read_iter()
Commit ceaa762527 ("block: move direct_IO into our own read_iter
handler") introduced several regressions for bdev DIO:

1. read spanning EOF always returns 0 instead of the number of bytes
   read.  This is because "count" is assigned early and isn't updated
   when the iterator is truncated:

     $ lsblk -o name,size /dev/vdb
     NAME SIZE
     vdb    1G
     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 0/4194304 bytes at offset 1070596096
     0.000000 bytes, 0 ops; 0.0007 sec (0.000000 bytes/sec and 0.0000 ops/sec)

     instead of

     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 3145728/4194304 bytes at offset 1070596096
     3 MiB, 1 ops; 0.0007 sec (3.865 GiB/sec and 1319.2612 ops/sec)

2. truncated iterator isn't reexpanded
3. iterator isn't reverted on blkdev_direct_IO() error
4. zero size read no longer skips atime update

Fixes: ceaa762527 ("block: move direct_IO into our own read_iter handler")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220201100420.25875-1-idryomov@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:48:27 -07:00
e45c47d1f9 block: add bio_start_io_acct_time() to control start_time
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-28 12:28:15 -07:00
592ee1197f blk-mq: fix missing blk_account_io_done() in error path
If blk_mq_request_issue_directly() failed from
blk_insert_cloned_request(), the request will be accounted start.
Currently, blk_insert_cloned_request() is only called by dm, and such
request won't be accounted done by dm.

In normal path, io will be accounted start from blk_mq_bio_to_request(),
when the request is allocated, and such io will be accounted done from
__blk_mq_end_request_acct() whether it succeeded or failed. Thus add
blk_account_io_done() to fix the problem.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220126012132.3111551-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-26 06:34:41 -07:00
83114df32a block: fix memory leak in disk_register_independent_access_ranges
kobject_init_and_add() takes reference even when it fails.
According to the doc of kobject_init_and_add()

   If this function returns an error, kobject_put() must be called to
   properly clean up the memory associated with the object.

Fix this issue by adding kobject_put().
Callback function blk_ia_ranges_sysfs_release() in kobject_put()
can handle the pointer "iars" properly.

Fixes: a2247f19ee ("block: Add independent access ranges support")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220120101025.22411-1-linmq006@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-23 09:13:09 -07:00
3689f9f8b0 Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux
Pull bitmap updates from Yury Norov:

 - introduce for_each_set_bitrange()

 - use find_first_*_bit() instead of find_next_*_bit() where possible

 - unify for_each_bit() macros

* tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
  vsprintf: rework bitmap_list_string
  lib: bitmap: add performance test for bitmap_print_to_pagebuf
  bitmap: unify find_bit operations
  mm/percpu: micro-optimize pcpu_is_populated()
  Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
  find: micro-optimize for_each_{set,clear}_bit()
  include/linux: move for_each_bit() macros from bitops.h to find.h
  cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
  tools: sync tools/bitmap with mother linux
  all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
  cpumask: use find_first_and_bit()
  lib: add find_first_and_bit()
  arch: remove GENERIC_FIND_FIRST_BIT entirely
  include: move find.h from asm_generic to linux
  bitops: move find_bit_*_le functions from le.h to find.h
  bitops: protect find_first_{,zero}_bit properly
2022-01-23 06:20:44 +02:00
0a4ee51818 mm: remove cleancache
Patch series "remove Xen tmem leftovers".

Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap.  This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.

This patch (of 13):

The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dc ("xen: remove tmem driver").

[akpm@linux-foundation.org: remove now-unreachable code]

Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:38 +02:00
3c7c25038b Merge tag 'block-5.17-2022-01-21' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Various little minor fixes that should go into this release:

   - Fix issue with cloned bios and IO accounting (Christoph)

   - Remove redundant assignments (Colin, GuoYong)

   - Fix an issue with the mq-deadline async_depth sysfs interface (me)

   - Fix brd module loading race (Tetsuo)

   - Shared tag map wakeup fix (Laibin)

   - End of bdev read fix (OGAWA)

   - srcu leak fix (Ming)"

* tag 'block-5.17-2022-01-21' of git://git.kernel.dk/linux-block:
  block: fix async_depth sysfs interface for mq-deadline
  block: Fix wrong offset in bio_truncate()
  block: assign bi_bdev for cloned bios in blk_rq_prep_clone
  block: cleanup q->srcu
  block: Remove unnecessary variable assignment
  brd: remove brd_devices_mutex mutex
  aoe: remove redundant assignment on variable n
  loop: remove redundant initialization of pointer node
  blk-mq: fix tag_get wait task can't be awakened
2022-01-21 16:17:03 +02:00
46cdc45acb block: fix async_depth sysfs interface for mq-deadline
A previous commit added this feature, but it inadvertently used the wrong
variable to show/store the setting from/to, victimized by copy/paste. Fix
it up so that the async_depth sysfs interface reads and writes from the
right setting.

Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-20 10:54:02 -07:00
3ee859e384 block: Fix wrong offset in bio_truncate()
bio_truncate() clears the buffer outside of last block of bdev, however
current bio_truncate() is using the wrong offset of page. So it can
return the uninitialized data.

This happened when both of truncated/corrupted FS and userspace (via
bdev) are trying to read the last of bdev.

Reported-by: syzbot+ac94ae5f68b84197f41c@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/875yqt1c9g.fsf@mail.parknet.co.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-20 06:30:12 -07:00
fd9f4e62a3 block: assign bi_bdev for cloned bios in blk_rq_prep_clone
bio_clone_fast() sets the cloned bio to have the same ->bi_bdev as the
source bio. This means that when request-based dm called setup_clone(),
the cloned bio had its ->bi_bdev pointing to the dm device. After Commit
0b6e522cdc ("blk-mq: use ->bi_bdev for I/O accounting")
__blk_account_io_start() started using the request's ->bio->bi_bdev for
I/O accounting, if it was set. This caused IO going to the underlying
devices to use the dm device for their I/O accounting.

Set up the proper ->bi_bdev in blk_rq_prep_clone based on the whole
device bdev for the queue the request is cloned onto.

Fixes: 0b6e522cdc ("blk-mq: use ->bi_bdev for I/O accounting")
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[hch: the commit message is mostly from a different patch from Benjamin]
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Link: https://lore.kernel.org/r/20220118070444.1241739-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-18 06:34:05 -07:00
850fd2abbe block: cleanup q->srcu
srcu structure has to be cleanup via cleanup_srcu_struct(), so fix it.

Reported-by: syzbot+4f789823c1abc5accf13@syzkaller.appspotmail.com
Fixes: 704b914f15 ("blk-mq: move srcu from blk_mq_hw_ctx to request_queue")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220111123401.520192-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-17 07:24:45 -07:00
e6a2e5116e block: Remove unnecessary variable assignment
The parameter "ret" should be zero when running to this line,
no need to set to zero again, remove it.

Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1642414957-6785-1-git-send-email-zhenggy@chinatelecom.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-17 07:24:04 -07:00
9b51d9d866 cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2022-01-15 08:47:31 -08:00
e1a7aa25ff Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, pm80xx, lpfc,
  mpi3mr, mpt3sas, hisi_sas, libsas) and minor updates and bug fixes.

  The most impactful change is likely the switch from GFP_DMA to
  GFP_KERNEL in a bunch of drivers, but even that shouldn't affect too
  many people"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (121 commits)
  scsi: mpi3mr: Bump driver version to 8.0.0.61.0
  scsi: mpi3mr: Fixes around reply request queues
  scsi: mpi3mr: Enhanced Task Management Support Reply handling
  scsi: mpi3mr: Use TM response codes from MPI3 headers
  scsi: mpi3mr: Add io_uring interface support in I/O-polled mode
  scsi: mpi3mr: Print cable mngnt and temp threshold events
  scsi: mpi3mr: Support Prepare for Reset event
  scsi: mpi3mr: Add Event acknowledgment logic
  scsi: mpi3mr: Gracefully handle online FW update operation
  scsi: mpi3mr: Detect async reset that occurred in firmware
  scsi: mpi3mr: Add IOC reinit function
  scsi: mpi3mr: Handle offline FW activation in graceful manner
  scsi: mpi3mr: Code refactor of IOC init - part2
  scsi: mpi3mr: Code refactor of IOC init - part1
  scsi: mpi3mr: Fault IOC when internal command gets timeout
  scsi: mpi3mr: Display IOC firmware package version
  scsi: mpi3mr: Handle unaligned PLL in unmap cmnds
  scsi: mpi3mr: Increase internal cmnds timeout to 60s
  scsi: mpi3mr: Do access status validation before adding devices
  scsi: mpi3mr: Add support for PCIe Managed Switch SES device
  ...
2022-01-14 14:37:34 +01:00
180dccb0db blk-mq: fix tag_get wait task can't be awakened
In case of shared tags, there might be more than one hctx which
allocates from the same tags, and each hctx is limited to allocate at
most:
        hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);

tag idle detection is lazy, and may be delayed for 30sec, so there
could be just one real active hctx(queue) but all others are actually
idle and still accounted as active because of the lazy idle detection.
Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
forever on this real active hctx.

Fix this by recalculating wake_batch when inc or dec active_queues.

Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-13 12:52:14 -07:00
f079ab01b5 Merge tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux
Pull iomap updates from Matthew Wilcox:
 "Convert xfs/iomap to use folios.

  This should be all that is needed for XFS to use large folios. There
  is no code in this pull request to create large folios, but no
  additional changes should be needed to XFS or iomap once they are
  created.

  Usually this would have come from Darrick, and we had intended that it
  would come that route. Between the holidays and various things which
  Darrick needed to work on, he asked if I could send things directly.

  There weren't any other iomap patches pending for this release, which
  probably also played a role"

* tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux: (26 commits)
  iomap: Inline __iomap_zero_iter into its caller
  xfs: Support large folios
  iomap: Support large folios in invalidatepage
  iomap: Convert iomap_migrate_page() to use folios
  iomap: Convert iomap_add_to_ioend() to take a folio
  iomap: Simplify iomap_do_writepage()
  iomap: Simplify iomap_writepage_map()
  iomap,xfs: Convert ->discard_page to ->discard_folio
  iomap: Convert iomap_write_end_inline to take a folio
  iomap: Convert iomap_write_begin() and iomap_write_end() to folios
  iomap: Convert __iomap_zero_iter to use a folio
  iomap: Allow iomap_write_begin() to be called with the full length
  iomap: Convert iomap_page_mkwrite to use a folio
  iomap: Convert readahead and readpage to use a folio
  iomap: Convert iomap_read_inline_data to take a folio
  iomap: Use folio offsets instead of page offsets
  iomap: Convert bio completions to use folios
  iomap: Pass the iomap_page into iomap_set_range_uptodate
  iomap: Add iomap_invalidate_folio
  iomap: Convert iomap_releasepage to use a folio
  ...
2022-01-12 12:51:41 -08:00
d3c8108035 Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:

 - Unify where the struct request handling code is located in the blk-mq
   code (Christoph)

 - Header cleanups (Christoph)

 - Clean up the io_context handling code (Christoph, me)

 - Get rid of ->rq_disk in struct request (Christoph)

 - Error handling fix for add_disk() (Christoph)

 - request allocation cleanusp (Christoph)

 - Documentation updates (Eric, Matthew)

 - Remove trivial crypto unregister helper (Eric)

 - Reduce shared tag overhead (John)

 - Reduce poll_stats memory overhead (me)

 - Known indirect function call for dio (me)

 - Use atomic references for struct request (me)

 - Support request list issue for block and NVMe (me)

 - Improve queue dispatch pinning (Ming)

 - Improve the direct list issue code (Keith)

 - BFQ improvements (Jan)

 - Direct completion helper and use it in mmc block (Sebastian)

 - Use raw spinlock for the blktrace code (Wander)

 - fsync error handling fix (Ye)

 - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me)

* tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits)
  MAINTAINERS: add entries for block layer documentation
  docs: block: remove queue-sysfs.rst
  docs: sysfs-block: document virt_boundary_mask
  docs: sysfs-block: document stable_writes
  docs: sysfs-block: fill in missing documentation from queue-sysfs.rst
  docs: sysfs-block: add contact for nomerges
  docs: sysfs-block: sort alphabetically
  docs: sysfs-block: move to stable directory
  block: don't protect submit_bio_checks by q_usage_counter
  block: fix old-style declaration
  nvme-pci: fix queue_rqs list splitting
  block: introduce rq_list_move
  block: introduce rq_list_for_each_safe macro
  block: move rq_list macros to blk-mq.h
  block: drop needless assignment in set_task_ioprio()
  block: remove unnecessary trailing '\'
  bio.h: fix kernel-doc warnings
  block: check minor range in device_add_disk()
  block: use "unsigned long" for blk_validate_block_size().
  block: fix error unwinding in device_add_disk
  ...
2022-01-12 10:26:52 -08:00
9d497e2941 block: don't protect submit_bio_checks by q_usage_counter
Commit cc9c884dd7 ("block: call submit_bio_checks under q_usage_counter")
uses q_usage_counter to protect submit_bio_checks for avoiding IO after
disk is deleted by del_gendisk().

Turns out the protection isn't necessary, because once
blk_mq_freeze_queue_wait() in del_gendisk() returns:

1) all in-flight IO has been done

2) all new IO will be failed in __bio_queue_enter() because
   q_usage_counter is dead, and GD_DEAD is set

3) both disk and request queue instance are safe since caller of
submit_bio() guarantees that the disk can't be closed.

Once submit_bio_checks() needn't the protection of q_usage_counter, we can
move submit_bio_checks before calling blk_mq_submit_bio() and
->submit_bio(). With this change, we needn't to throttle queue with
holding one allocated request, then precise driver tag or request won't be
wasted in throttling. Meantime we can unify the bio check for both bio
based and request based driver.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220104134223.590803-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-09 18:54:52 -07:00
669a064625 block: drop needless assignment in set_task_ioprio()
Commit 5fc11eebb4 ("block: open code create_task_io_context in
set_task_ioprio") introduces a needless assignment
'ioc = task->io_context', as the local variable ioc is not further
used before returning.

Even after the further fix, commit a957b61254 ("block: fix error in
handling dead task for ioprio setting"), the assignment still remains
needless.

Drop this needless assignment in set_task_ioprio().

This code smell was identified with 'make clang-analyzer'.

Fixes: 5fc11eebb4 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211223125300.20691-1-lukas.bulwahn@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-23 07:10:07 -07:00
6e1fcab00a scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume()
John Garry reported a deadlock that occurs when trying to access a
runtime-suspended SATA device.  For obscure reasons, the rescan procedure
causes the link to be hard-reset, which disconnects the device.

The rescan tries to carry out a runtime resume when accessing the device.
scsi_rescan_device() holds the SCSI device lock and won't release it until
it can put commands onto the device's block queue.  This can't happen until
the queue is successfully runtime-resumed or the device is unregistered.
But the runtime resume fails because the device is disconnected, and
__scsi_remove_device() can't do the unregistration because it can't get the
device lock.

The best way to resolve this deadlock appears to be to allow the block
queue to start running again even after an unsuccessful runtime resume.
The idea is that the driver or the SCSI error handler will need to be able
to use the queue to resolve the runtime resume failure.

This patch removes the err argument to blk_post_runtime_resume() and makes
the routine act as though the resume was successful always.  This fixes the
deadlock.

Link: https://lore.kernel.org/r/1639999298-244569-4-git-send-email-chenxiang66@hisilicon.com
Fixes: e27829dc92 ("scsi: serialize ->rescan against ->remove")
Reported-and-tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-22 23:38:29 -05:00
e338924bd0 block: check minor range in device_add_disk()
ioctl(fd, LOOP_CTL_ADD, 1048576) causes

  sysfs: cannot create duplicate filename '/dev/block/7:0'

message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.

Reported-by: wangyangbo <wangyangbo@uniontech.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/b1b19379-23ee-5379-0eb5-94bf5f79f1b4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-21 09:34:29 -07:00
99d8690aae block: fix error unwinding in device_add_disk
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice.  Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.

Based on an earlier patch from Tetsuo Handa.

Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-21 09:31:51 -07:00
37e11c3616 block: call blk_exit_queue() before freeing q->stats
blk_stat_disable_accounting() is added in commit 68497092bd
("block: make queue stat accounting a reference"), and called in
kyber_exit_sched().

So we have to free q->stats after elevator is unloaded from
blk_exit_queue() in blk_release_queue(). Otherwise kernel panic
is caused.

Fixes: 68497092bd ("block: make queue stat accounting a reference")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211221040436.1333880-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 21:07:51 -07:00
a957b61254 block: fix error in handling dead task for ioprio setting
Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com
Fixes: 5fc11eebb4 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 20:32:24 -07:00
518579a9af blk-mq: blk-mq: check quiesce state before queue_rqs
The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().

Fixes: 3c67d44de7 ("block: add mq_ops->queue_rqs hook")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 14:03:59 -07:00
2da09da4ae Merge tag 'block-5.16-2021-12-19' of git://git.kernel.dk/linux-block
Pull block revert from Jens Axboe:
 "It turns out that the fix for not hammering on the delayed work timer
  too much caused a performance regression for BFQ, so let's revert the
  change for now.

  I've got some ideas on how to fix it appropriately, but they should
  wait for 5.17"

* tag 'block-5.16-2021-12-19' of git://git.kernel.dk/linux-block:
  Revert "block: reduce kblockd_mod_delayed_work_on() CPU consumption"
2021-12-19 12:38:53 -08:00
87959fa16c Revert "block: reduce kblockd_mod_delayed_work_on() CPU consumption"
This reverts commit cb2ac2912a.

Alex and the kernel test robot report that this causes a significant
performance regression with BFQ. I can reproduce that result, so let's
revert this one as we're close to -rc6 and we there's no point in trying
to rush a fix.

Link: https://lore.kernel.org/linux-block/1639853092.524jxfaem2.none@localhost/
Link: https://lore.kernel.org/lkml/20211219141852.GH14057@xsang-OptiPlex-9020/
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-19 07:58:44 -07:00
fa09ca5ebc Merge tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - Fix for hammering on the delayed run queue timer (me)

 - bcache regression fix for this merge window (Lin)

 - Fix a divide-by-zero in the blk-iocost code (Tejun)

* tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block:
  bcache: fix NULL pointer reference in cached_dev_detach_finish
  block: reduce kblockd_mod_delayed_work_on() CPU consumption
  iocost: Fix divide-by-zero on donation from low hweight cgroup
2021-12-17 11:46:07 -08:00
85f5a74c2b block: Add bio_add_folio()
This is a thin wrapper around bio_add_page().  The main advantage here
is the documentation that folios larger than 2GiB are not supported.
It's not currently possible to allocate folios that large, but if it
ever becomes possible, this function will fail gracefully instead of
doing I/O to the wrong bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-16 15:49:51 -05:00
5ef1630586 block: only build the icq tracking code when needed
Only bfq needs to code to track icq, so make it conditional.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
90b627f542 block: fold create_task_io_context into ioc_find_get_icq
Fold create_task_io_context into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
5fc11eebb4 block: open code create_task_io_context in set_task_ioprio
The flow in set_task_ioprio can be simplified by simply open coding
create_task_io_context, which removes a refcount roundtrip on the I/O
context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
8472161b77 block: fold get_task_io_context into set_task_ioprio
Fold get_task_io_context into its only caller, and simplify the code
as no reference to the I/O context is required to just set the ioprio
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
a411cd3cfd block: move set_task_ioprio to blk-ioc.c
Keep set_task_ioprio with the other low-level code that accesses the
io_context structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
091abcb3ef block: cleanup ioc_clear_queue
Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
edf70ff5a1 block: refactor put_io_context
Move the code to delay freeing the icqs into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
8a20c0c7e0 block: remove the NULL ioc check in put_io_context
No caller passes in a NULL pointer, so remove the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
4be8a2eaff block: refactor put_iocontext_active
Factor out a ioc_exit_icqs helper to tear down the icqs and the fold
the rest of put_iocontext_active into exit_io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
0aed2f162b block: simplify struct io_context refcounting
Don't hold a reference to ->refcount for each active reference, but
just one for all active references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
8a2ba1785c block: remove the nr_task field from struct io_context
Nothing ever looks at ->nr_tasks, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
3c67d44de7 block: add mq_ops->queue_rqs hook
If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:49:17 -07:00
fcade2ce06 block: use singly linked list for bio cache
Pointless to maintain a head/tail for the list, as we never need to
access the tail. Entries are always LIFO for cache hotness reasons.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:43:09 -07:00
5581a5ddfe block: add completion handler for fast path
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:42:06 -07:00