Commit Graph

1169608 Commits

Author SHA1 Message Date
Eric Blake
daf376a366 uapi nbd: improve doc links to userspace spec
The uapi <linux/nbd.h> header intentionally documents only the NBD
server features that the kernel module will utilize as a client.  But
while it already had one mention of skipped bits due to userspace
extensions, it did not actually direct the reader to the canonical
source to learn about those extensions.

While touching comments, fix an outdated reference that listed only
READ and WRITE as commands.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230410180611.1051618-2-eblake@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-27 19:15:11 -06:00
Thomas Weißschuh
ff53cd52d9 blk-integrity: register sysfs attributes on struct device
The "integrity" kobject only acted as a holder for static sysfs entries.
It also was embedded into struct gendisk without managing it, violating
assumptions of the driver core.

Instead register the sysfs entries directly onto the struct device.

Also drop the now unused member integrity_kobj from struct gendisk.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-3-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26 18:22:50 -06:00
Thomas Weißschuh
76b8c319f0 blk-integrity: convert to struct device_attribute
An upcoming patch will register the integrity attributes directly with
the struct device kobject.
For this the attributes have to be implemented in terms of
struct device_attribute.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-2-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26 18:22:50 -06:00
Thomas Weißschuh
3315e169b4 blk-integrity: use sysfs_emit
The correct way to emit data into sysfs is via sysfs_emit(), use it.

Also perform some trivial syntactic cleanups.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-1-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26 18:22:50 -06:00
Chaitanya Kulkarni
3f89ac587b block/drivers: remove dead clear of random flag
QUEUE_FLAG_ADD_RANDOM is not set before we clear it for "null_blk",
"brd", "nbd", "zram", and "bcache" since by default we don't set
"QUEUE_FLAG_ADD_RANDOM" to MQ ops.

Remove dead clear of QUEUE_FLAG_ADD_RANDOM in above listed drivers.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram
Link: https://lore.kernel.org/r/20230424234628.45544-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-25 08:02:11 -06:00
Ming Lei
38c8e3dfb2 block: sync part's ->bd_has_submit_bio with disk's
submit_bio() always uses bio->bi_bdev->bd_has_submit_bio to decide if
disk's ->submit_bio() is called, and bio->bi_bdev could point to one
partition device.

So we have to sync part bdev's ->bd_has_submit_bio with disk's.

Reported-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/linux-block/ZEdItaPqif8fp85H@ovpn-8-24.pek2.redhat.com/T/#t
Fixes: 9f4107b07b ("block: store bdev->bd_disk->fops->submit_bio state in bdev")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230425034154.110099-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-25 07:36:02 -06:00
Damien Le Moal
83794367dc block: Cleanup set_capacity()/bdev_set_nr_sectors()
The code for setting a block device capacity (bd_nr_sectors field of
struct block_device) is duplicated in set_capacity() and
bdev_set_nr_sectors(). Clean this up by making bdev_set_nr_sectors()
a block layer internal function defined in block/bdev.c instead of
having this function statically defined in block/partitions/core.c.
With this change, set_capacity() implementation can be simplified to
only calling bdev_set_nr_sectors().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230424131318.79935-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-24 08:48:09 -06:00
Zhong Jinghua
55793ea54d nbd: fix incomplete validation of ioctl arg
We tested and found an alarm caused by nbd_ioctl arg without verification.
The UBSAN warning calltrace like below:

UBSAN: Undefined behaviour in fs/buffer.c:1709:35
signed integer overflow:
-9223372036854775808 - 1 cannot be represented in type 'long long int'
CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x170/0x1dc lib/dump_stack.c:118
 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206
 __block_write_full_page+0x94c/0xa20 fs/buffer.c:1709
 block_write_full_page+0x1f0/0x280 fs/buffer.c:2934
 blkdev_writepage+0x34/0x40 fs/block_dev.c:607
 __writepage+0x68/0xe8 mm/page-writeback.c:2305
 write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240
 generic_writepages+0xdc/0x148 mm/page-writeback.c:2329
 blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114
 do_writepages+0xd4/0x250 mm/page-writeback.c:2344

The reason for triggering this warning is __block_write_full_page()
-> i_size_read(inode) - 1 overflow.
inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize.
We think it is necessary to limit the size of arg to prevent errors.

Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int.
Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine),
it will become 1 after the coercion, which will return unexpected results.

Fix it by adding checks to prevent passing in too large numbers.

Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-20 13:43:44 -06:00
Ming Lei
7c75661c42 ublk: don't return 0 in case of any failure
Commit 2d786e66c9 ("block: ublk: switch to ioctl command encoding")
starts to reset local variable of 'ret' as zero, then if any failure
happens when handling the three IO commands, 0 can be returned to ublk
server.

Fix it by returning -EINVAL in case of command handling failure.

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 2d786e66c9 ("block: ublk: switch to ioctl command encoding")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230420091104.1092972-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-20 07:03:02 -06:00
Ondrej Kozina
9e05a2599a sed-opal: geometry feature reporting command
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.

Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).

4.3.5.2.1.1 RangeStart Behavior:

[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]

When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:

a) the AlignmentRequired (ALIGN above) column in the LockingInfo
   table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
   return an error status code INVALID_PARAMETER.

4.3.5.2.1.2 RangeLength Behavior:

If RangeStart is zero, then
	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]

If RangeStart is non-zero, then
	[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]

When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:

a) the AlignmentRequired (ALIGN above) column in the LockingInfo
   table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
   return an error status code INVALID_PARAMETER

In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-19 14:07:13 -06:00
Chaitanya Kulkarni
63f8793ee6 null_blk: Always check queue mode setting from configfs
Make sure to check device queue mode in the null_validate_conf() and
return error for NULL_Q_RQ as we don't allow legacy I/O path, without
this patch we get OOPs when queue mode is set to 1 from configfs,
following are repro steps :-

modprobe null_blk nr_devices=0
mkdir config/nullb/nullb0
echo 1 > config/nullb/nullb0/memory_backed
echo 4096 > config/nullb/nullb0/blocksize
echo 20480 > config/nullb/nullb0/size
echo 1 > config/nullb/nullb0/queue_mode
echo 1 > config/nullb/nullb0/power

Entering kdb (current=0xffff88810acdd080, pid 2372) on processor 42 Oops: (null)
due to oops @ 0xffffffffc041c329
CPU: 42 PID: 2372 Comm: sh Tainted: G           O     N 6.3.0-rc5lblk+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:null_add_dev.part.0+0xd9/0x720 [null_blk]
Code: 01 00 00 85 d2 0f 85 a1 03 00 00 48 83 bb 08 01 00 00 00 0f 85 f7 03 00 00 80 bb 62 01 00 00 00 48 8b 75 20 0f 85 6d 02 00 00 <48> 89 6e 60 48 8b 75 20 bf 06 00 00 00 e8 f5 37 2c c1 48 8b 75 20
RSP: 0018:ffffc900052cbde0 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88811084d800 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888100042e00
RBP: ffff8881053d8200 R08: ffffc900052cbd68 R09: ffff888105db2000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000002
R13: ffff888104765200 R14: ffff88810eec1748 R15: ffff88810eec1740
FS:  00007fd445fd1740(0000) GS:ffff8897dfc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000060 CR3: 0000000166a00000 CR4: 0000000000350ee0
DR0: ffffffff8437a488 DR1: ffffffff8437a489 DR2: ffffffff8437a48a
DR3: ffffffff8437a48b DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 nullb_device_power_store+0xd1/0x120 [null_blk]
 configfs_write_iter+0xb4/0x120
 vfs_write+0x2ba/0x3c0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd4460c57a7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd3792a4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd4460c57a7
RDX: 0000000000000002 RSI: 000055b43c02e4c0 RDI: 0000000000000001
RBP: 000055b43c02e4c0 R08: 000000000000000a R09: 00007fd44615b4e0
R10: 00007fd44615b3e0 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fd446198520 R14: 0000000000000002 R15: 00007fd446198700
 </TASK>

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Link: https://lore.kernel.org/r/20230416220339.43845-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18 20:15:35 -06:00
Ming Lei
2d786e66c9 block: ublk: switch to ioctl command encoding
All ublk commands(control, IO) should have taken ioctl command encoding
from the beginning, because ioctl command encoding defines each code
uniquely, so driver can figure out wrong command sent from userspace
easily; 2) it might help security subsystem for audit uring cmd[1].

Unfortunately we didn't do that way, and it could be one lesson for
ublk driver.

So switch to ioctl command encoding now, we still support commands encoded
in old way, but they become legacy definition. Any new command should take
ioctl encoding.

See ublksrv code for switching to ioctl command encoding in [2].

[1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
[2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18 20:13:30 -06:00
Christoph Hellwig
26a42b614e blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush
Commit b12e5c6c75 accidentally changes blk_kick_flush to do a head
insert into the requeue list, fix this up.

Fixes: b12e5c6c75 ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16 13:01:43 -06:00
Colin Ian King
e53413f8de block, bfq: Fix division by zero error on zero wsum
When the weighted sum is zero the calculation of limit causes
a division by zero error. Fix this by continuing to the next level.

This was discovered by running as root:

stress-ng --ioprio 0

Fixes divison by error oops:

[  521.450556] divide error: 0000 [#1] SMP NOPTI
[  521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1
[  521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[  521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400
[  521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44
[  521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046
[  521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
[  521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978
[  521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0
[  521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18
[  521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970
[  521.454549] FS:  00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000
[  521.454912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0
[  521.455491] PKRU: 55555554
[  521.455619] Call Trace:
[  521.455736]  <TASK>
[  521.455837]  ? bfq_request_merge+0x3a/0xc0
[  521.456027]  ? elv_merge+0x115/0x140
[  521.456191]  bfq_limit_depth+0xc8/0x240
[  521.456366]  __blk_mq_alloc_requests+0x21a/0x2c0
[  521.456577]  blk_mq_submit_bio+0x23c/0x6c0
[  521.456766]  __submit_bio+0xb8/0x140
[  521.457236]  submit_bio_noacct_nocheck+0x212/0x300
[  521.457748]  submit_bio_noacct+0x1a6/0x580
[  521.458220]  submit_bio+0x43/0x80
[  521.458660]  ext4_io_submit+0x23/0x80
[  521.459116]  ext4_do_writepages+0x40a/0xd00
[  521.459596]  ext4_writepages+0x65/0x100
[  521.460050]  do_writepages+0xb7/0x1c0
[  521.460492]  __filemap_fdatawrite_range+0xa6/0x100
[  521.460979]  file_write_and_wait_range+0xbf/0x140
[  521.461452]  ext4_sync_file+0x105/0x340
[  521.461882]  __x64_sys_fsync+0x67/0x100
[  521.462305]  ? syscall_exit_to_user_mode+0x2c/0x1c0
[  521.462768]  do_syscall_64+0x3b/0xc0
[  521.463165]  entry_SYSCALL_64_after_hwframe+0x5a/0xc4
[  521.463621] RIP: 0033:0x5640b6c56590
[  521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16 13:01:43 -06:00
Akinobu Mita
d325c16263 fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m
This fixes a build error when CONFIG_FAULT_INJECTION_CONFIGFS=y and
CONFIG_CONFIGFS_FS=m.

Since the fault-injection library cannot built as a module, avoid building
configfs as a module.

Fixes: 4668c7a294 ("fault-inject: allow configuration via configfs")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202304150025.K0hczLR4-lkp@intel.com/
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16 13:01:42 -06:00
Jens Axboe
9f4107b07b block: store bdev->bd_disk->fops->submit_bio state in bdev
We have a long chain of memory dereferencing just to whether or not
this disk has a special submit_bio helper. As that's not necessarily
the common case, add a bd_has_submit_bio state in the bdev to avoid
traversing this memory dependency chain if we don't need to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16 13:01:42 -06:00
Jens Axboe
3838c406a5 block: re-arrange the struct block_device fields for better layout
This moves struct device out-of-line as it's just used at open/close
time, so we can keep some of the commonly used fields closer together.
On a standard setup, it also reduces the size from 864 bytes to 848
bytes. Yes, struct device is a pig...

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16 13:01:35 -06:00
Jens Axboe
310e9c85c6 Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.4/block
Pull MD updates from Song:

"- md/bitmap: Optimal last page size, by Jon Derrick
 - Various raid10 fixes, by Yu Kuai and Li Nan
 - md: add error_handlers for raid0 and linear, by Mariusz Tkaczyk"

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid5: remove unused working_disks variable
  md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
  md/raid10: fix memleak of md thread
  md/raid10: fix memleak for 'conf->bio_split'
  md/raid10: fix leak of 'r10bio->remaining' for recovery
  md/raid10: don't BUG_ON() in raise_barrier()
  md: fix soft lockup in status_resync
  md: add error_handlers for raid0 and linear
  md: Use optimal I/O size for last bitmap page
  md: Fix types in sb writer
  md: Move sb writer loop to its own function
  md/raid10: Fix typo in comment (replacment -> replacement)
  md: make kobj_type structures constant
  md/raid10: fix null-ptr-deref in raid10_sync_request
  md/raid10: fix task hung in raid10d
2023-04-14 06:32:44 -06:00
Jens Axboe
d2a1d45ced nvme updates for Linux 6.4
- drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
  - validate nvmet module parameters (Chaitanya Kulkarni)
  - fence TCP socket on receive error (Chris Leech)
  - fix async event trace event (Keith Busch)
  - minor cleanups (Chaitanya Kulkarni, zhenwei pi)
  - fix and cleanup nvmet Identify handling (Damien Le Moal,
    Christoph Hellwig)
  - fix double blk_mq_complete_request race in the timeout handler
    (Lei Yin)
  - fix irq locking in nvme-fcloop (Ming Lei)
  - remove queue mapping helper for rdma devices (Sagi Grimberg)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmQ44sALHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMH7hAAxN15Sw3pkk1UBpDQDXjAqzftc1nJ/wBZabsQ9k3s
 Qoye1TRAnv/qt78wTzl660sj/tNnz8vMXSb9Wh5Vi+y+tNB3IN7JnVDpng9M6bsH
 /RNxLmUTiaa7sT/IhqU7dq7kxHL1aFWawsQwnnGQnXYOjc3RC/Hf25f59WKRDQic
 kAjsE55F6fpn/ry+DU8Ia8IPq22IUk56JONO01LpxGrfRgNC4P4hkpQJk7n2CFkd
 xBKntuCLDiLzRS5RVH8KcNOhhx/L6JRvl1xwkc/CRWt/DvGHfhbnTZ9e4Vn30XF4
 3aCpBQu+CiNJPcpdiOD0CH0iOAio0o0klbOLmlo5Bg19Cw+ALqPIZrHU+UivJxw4
 U1I4mkmB3ydHQlurVm4KemRih9PT/rw2cgTwogyhfNGw9rKjV/F2Exs6HFHIpP8X
 SgvomWXFSJ5saYswMoNIYvJHz+CISbq+XsLv0iBCAS7U3ZCqw4U5VkKLHH4hIYXG
 wjyGdGNwPE6JghCtHVkS4ZwSqkAwAaOWqdX3E4CzHYN6zn9nkPLurcwgfksgrnPP
 Z/Nzfz3Wwh7NzZlUyyFjUB4Iu80Up5zZZiz0ZQC+QiLVvy89weNPpnpN1vkd8dex
 hRKa2D0cfUyhpYzZssa/6CTHGOLYgpymUYGNitZtf0LKyhwgBSLOwcfk8XLxrFru
 U7E=
 =wI74
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme into for-6.4/block

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.4

 - drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
 - validate nvmet module parameters (Chaitanya Kulkarni)
 - fence TCP socket on receive error (Chris Leech)
 - fix async event trace event (Keith Busch)
 - minor cleanups (Chaitanya Kulkarni, zhenwei pi)
 - fix and cleanup nvmet Identify handling (Damien Le Moal,
   Christoph Hellwig)
 - fix double blk_mq_complete_request race in the timeout handler
   (Lei Yin)
 - fix irq locking in nvme-fcloop (Ming Lei)
 - remove queue mapping helper for rdma devices (Sagi Grimberg)"

* tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme:
  nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage"
  blk-mq-rdma: remove queue mapping helper for rdma devices
  nvme-rdma: minor cleanup in nvme_rdma_create_cq()
  nvme: fix double blk_mq_complete_request for timeout request with low probability
  nvme: fix async event trace event
  nvme-apple: return directly instead of else
  nvme-apple: return directly instead of else
  nvmet-tcp: validate idle poll modparam value
  nvmet-tcp: validate so_priority modparam value
  nvme-tcp: fence TCP socket on receive error
  nvmet: remove nvmet_req_cns_error_complete
  nvmet: rename nvmet_execute_identify_cns_cs_ns
  nvmet: fix Identify Identification Descriptor List handling
  nvmet: cleanup nvmet_execute_identify()
  nvmet: fix I/O Command Set specific Identify Controller
  nvmet: fix Identify Active Namespace ID list handling
  nvmet: fix Identify Controller handling
  nvmet: fix Identify Namespace handling
  nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()
  nvme-pci: drop redundant pci_enable_pcie_error_reporting()
2023-04-14 06:31:29 -06:00
Tom Rix
7bc436121e md/raid5: remove unused working_disks variable
clang with W=1 reports
drivers/md/raid5.c:7719:6: error: variable 'working_disks'
  set but not used [-Werror,-Wunused-but-set-variable]
        int working_disks = 0;
            ^
This variable is not used so remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230327132324.1769595-1-trix@redhat.com
2023-04-14 00:42:04 -07:00
Yu Kuai
7cddb055bf md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
handle_read_error() will resumit r10_bio by raid10_read_request(), which
will call bio_start_io_acct() again, while bio_end_io_acct() will only
be called once.

Fix the problem by don't account io again from handle_read_error().

Fixes: 528bc2cf2f ("md/raid10: enable io accounting")
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com
2023-04-14 00:42:03 -07:00
Yu Kuai
f0ddb83da3 md/raid10: fix memleak of md thread
In raid10_run(), if setup_conf() succeed and raid10_run() failed before
setting 'mddev->thread', then in the error path 'conf->thread' is not
freed.

Fix the problem by setting 'mddev->thread' right after setup_conf().

Fixes: 43a521238a ("md-cluster: choose correct label when clustered layout is not supported")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com
2023-04-14 00:42:03 -07:00
Yu Kuai
c9ac2acde5 md/raid10: fix memleak for 'conf->bio_split'
In the error path of raid10_run(), 'conf' need be freed, however,
'conf->bio_split' is missed and memory will be leaked.

Since there are 3 places to free 'conf', factor out a helper to fix the
problem.

Fixes: fc9977dd06 ("md/raid10: simplify the splitting of requests.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com
2023-04-14 00:42:03 -07:00
Yu Kuai
26208a7cff md/raid10: fix leak of 'r10bio->remaining' for recovery
raid10_sync_request() will add 'r10bio->remaining' for both rdev and
replacement rdev. However, if the read io fails, recovery_request_write()
returns without issuing the write io, in this case, end_sync_request()
is only called once and 'remaining' is leaked, cause an io hang.

Fix the problem by decreasing 'remaining' according to if 'bio' and
'repl_bio' is valid.

Fixes: 24afd80d99 ("md/raid10: handle recovery of replacement devices.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com
2023-04-14 00:41:40 -07:00
Yu Kuai
9fdfe6d45b md/raid10: don't BUG_ON() in raise_barrier()
If raise_barrier() is called the first time in raid10_sync_request(), which
means the first non-normal io is handled, raise_barrier() should wait for
all dispatched normal io to be done. This ensures that normal io won't
starve.

However, BUG_ON() if this is broken is too aggressive. This patch replace
BUG_ON() with WARN and fall back to not force.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-4-yukuai1@huaweicloud.com
2023-04-13 22:20:24 -07:00
Yu Kuai
6efddf1e32 md: fix soft lockup in status_resync
status_resync() will calculate 'curr_resync - recovery_active' to show
user a progress bar like following:

[============>........]  resync = 61.4%

'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
status_resync() can read them concurrently, hence it's possible that
'curr_resync - recovery_active' can overflow to a huge number. In this
case status_resync() will be stuck in the loop to print a large amount
of '=', which will end up soft lockup.

Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
this way resync in progress will be reported to user.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
2023-04-13 22:20:24 -07:00
Mariusz Tkaczyk
c31fea2f8e md: add error_handlers for raid0 and linear
After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.

For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.

This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.

Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
2023-04-13 22:20:24 -07:00
Jon Derrick
8745faa956 md: Use optimal I/O size for last bitmap page
If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.

The drive this was tested against has higher latencies when there are
sub-4k writes due to device-side read-mod-writes of its atomic 4k write
unit. This change helps increase performance by sizing the last bitmap
page I/O for the device's preferred write unit, if it is given.

Example Intel/Solidigm P5520
Raid10, Chunk-size 64M, bitmap-size 57228 bits

$ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1
        --assume-clean --bitmap=internal --bitmap-chunk=64M
$ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60

Without patch:
  write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets

With patch:
  write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets

Biosnoop:
Without patch:
Time        Process        PID     Device      LBA        Size      Lat
1.410377    md0_raid10     6900    nvme0n1   W 16         4096      0.02
1.410387    md0_raid10     6900    nvme2n1   W 16         4096      0.02
1.410374    md0_raid10     6900    nvme3n1   W 16         4096      0.01
1.410381    md0_raid10     6900    nvme1n1   W 16         4096      0.02
1.410411    md0_raid10     6900    nvme1n1   W 115346512  4096      0.01
1.410418    md0_raid10     6900    nvme0n1   W 115346512  4096      0.02
1.410915    md0_raid10     6900    nvme2n1   W 24         3584      0.43 <--
1.410935    md0_raid10     6900    nvme3n1   W 24         3584      0.45 <--
1.411124    md0_raid10     6900    nvme1n1   W 24         3584      0.64 <--
1.411147    md0_raid10     6900    nvme0n1   W 24         3584      0.66 <--
1.411176    md0_raid10     6900    nvme3n1   W 2019022184 4096      0.01
1.411189    md0_raid10     6900    nvme2n1   W 2019022184 4096      0.02

With patch:
Time        Process        PID     Device      LBA        Size      Lat
5.747193    md0_raid10     727     nvme0n1   W 16         4096      0.01
5.747192    md0_raid10     727     nvme1n1   W 16         4096      0.02
5.747195    md0_raid10     727     nvme3n1   W 16         4096      0.01
5.747202    md0_raid10     727     nvme2n1   W 16         4096      0.02
5.747229    md0_raid10     727     nvme3n1   W 1196223704 4096      0.02
5.747224    md0_raid10     727     nvme0n1   W 1196223704 4096      0.01
5.747279    md0_raid10     727     nvme0n1   W 24         4096      0.01 <--
5.747279    md0_raid10     727     nvme1n1   W 24         4096      0.02 <--
5.747284    md0_raid10     727     nvme3n1   W 24         4096      0.02 <--
5.747291    md0_raid10     727     nvme2n1   W 24         4096      0.02 <--
5.747314    md0_raid10     727     nvme2n1   W 2234636712 4096      0.01
5.747317    md0_raid10     727     nvme1n1   W 2234636712 4096      0.02

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
2023-04-13 22:20:24 -07:00
Jon Derrick
10172f200b md: Fix types in sb writer
Page->index is a pgoff_t and multiplying could cause overflows on a
32-bit architecture. In the sb writer, this is used to calculate and
verify the sector being used, and is multiplied by a sector value. Using
sector_t will cast it to a u64 type and is the more appropriate type for
the unit. Additionally, the integer size unit is converted to a sector
unit in later calculations, and is now corrected to be an unsigned type.

Finally, clean up the calculations using variable aliases to improve
readabiliy.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-3-jonathan.derrick@linux.dev
2023-04-13 22:20:24 -07:00
Jon Derrick
328e17d8d9 md: Move sb writer loop to its own function
Preparatory patch for optimal I/O size calculation. Move the sb writer
loop routine into its own function for clarity.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-2-jonathan.derrick@linux.dev
2023-04-13 22:20:23 -07:00
Jiangshan Yi
dccb8ad615 md/raid10: Fix typo in comment (replacment -> replacement)
Replace replacment with replacement.

Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214064013.2373851-1-yijiangshan@kylinos.cn
2023-04-13 22:20:23 -07:00
Thomas Weißschuh
4d72a9de2f md: make kobj_type structures constant
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net
2023-04-13 22:20:23 -07:00
Li Nan
a405c6f022 md/raid10: fix null-ptr-deref in raid10_sync_request
init_resync() inits mempool and sets conf->have_replacemnt at the beginning
of sync, close_sync() frees the mempool when sync is completed.

After [1] recovery might be skipped and init_resync() is called but
close_sync() is not. null-ptr-deref occurs with r10bio->dev[i].repl_bio.

The following is one way to reproduce the issue.

  1) create a array, wait for resync to complete, mddev->recovery_cp is set
     to MaxSector.
  2) recovery is woken and it is skipped. conf->have_replacement is set to
     0 in init_resync(). close_sync() not called.
  3) some io errors and rdev A is set to WantReplacement.
  4) a new device is added and set to A's replacement.
  5) recovery is woken, A have replacement, but conf->have_replacemnt is
     0. r10bio->dev[i].repl_bio will not be alloced and null-ptr-deref
     occurs.

Fix it by not calling init_resync() if recovery skipped.

[1] commit 7e83ccbecd ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Fixes: 7e83ccbecd ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Cc: stable@vger.kernel.org
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-3-linan666@huaweicloud.com
2023-04-13 22:20:23 -07:00
Li Nan
72c215ed87 md/raid10: fix task hung in raid10d
commit fe630de009 ("md/raid10: avoid deadlock on recovery.") allowed
normal io and sync io to exist at the same time. Task hung will occur as
below:

T1                      T2		T3		T4
raid10d
 handle_read_error
  allow_barrier
   conf->nr_pending--
    -> 0
                        //submit sync io
                        raid10_sync_request
                         raise_barrier
			  ->will not be blocked
			  ...
			//submit to drivers
  raid10_read_request
   wait_barrier
    conf->nr_pending++
     -> 1
					//retry read fail
					raid10_end_read_request
					 reschedule_retry
					  add to retry_list
					  conf->nr_queued++
					   -> 1
							//sync io fail
							end_sync_read
							 __end_sync_read
							  reschedule_retry
							   add to retry_list
					                    conf->nr_queued++
							     -> 2
 ...
 handle_read_error
 get form retry_list
 conf->nr_queued--
  freeze_array
   wait nr_pending == nr_queued+1
        ->1	      ->2
   //task hung

retry read and sync io will be added to retry_list(nr_queued->2) if they
fails. raid10d() called handle_read_error() and hung in freeze_array().
nr_queued will not decrease because raid10d is blocked, nr_pending will
not increase because conf->barrier is not released.

Fix it by moving allow_barrier() after raid10_read_request().
raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io
and regular io will not be issued at the same time.

Also remove the check of nr_queued in stop_waiting_barrier. It can be 0
but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as
the check is redundent.

Fixes: fe630de009 ("md/raid10: avoid deadlock on recovery.")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com
2023-04-13 22:20:23 -07:00
Akinobu Mita
bb4c19e030 block: null_blk: make fault-injection dynamically configurable per device
The null_blk driver has multiple driver-specific fault injection
mechanisms.  Each fault injection configuration can only be specified by a
module parameter and cannot be reconfigured without reloading the driver.
Also, each configuration is common to all devices and is initialized every
time a new device is added.

This change adds the following subdirectories for each null_blk device.

/sys/kernel/config/nullb/<disk>/timeout_inject
/sys/kernel/config/nullb/<disk>/requeue_inject
/sys/kernel/config/nullb/<disk>/init_hctx_fault_inject

Each fault injection attribute can be dynamically set per device by a
corresponding file in these directories.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-3-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 07:38:55 -06:00
Akinobu Mita
4668c7a294 fault-inject: allow configuration via configfs
This provides a helper function to allow configuration of fault-injection
for configfs-based drivers.

The config items created by this function have the same interface as the
one created under debugfs by fault_create_debugfs_attr().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-2-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 07:38:54 -06:00
Christoph Hellwig
4d5bba5bee blk-mq: remove __blk_mq_run_hw_queue
__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from
interrupt context and a blk_mq_run_dispatch_ops-protected call to
blk_mq_sched_dispatch_requests.  Open code the call to
blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE
to blk_mq_run_hw_queue where it can be extended to all !async calls,
while the other call is from workqueue context and thus obviously does
not need the assert.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:58:02 -06:00
Christoph Hellwig
1aa8d875b5 blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queue
Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with
async=false, so move the handling there.

With this __blk_mq_delay_run_hw_queue can be merged into
blk_mq_delay_run_hw_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:57:18 -06:00
Christoph Hellwig
cd735e1113 blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queue
For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in
blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection.
For the async dispatch case having a check before scheduling the work
still makes sense to avoid needless workqueue scheduling, so just keep it
for that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:57:18 -06:00
Christoph Hellwig
c20a1a2c1a blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fn
blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests
under blk_mq_run_dispatch_ops() protection, so remove the duplicate check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:57:18 -06:00
Christoph Hellwig
89ea5ceb53 blk-mq: cleanup __blk_mq_sched_dispatch_requests
__blk_mq_sched_dispatch_requests currently has duplicated logic
for the cases where requests are on the hctx dispatch list or not.
Merge the two with a new need_dispatch variable and remove a few
pointless local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:57:18 -06:00
Christoph Hellwig
b12e5c6c75 blk-mq: pass a flags argument to blk_mq_add_to_requeue_list
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
93fffe16f7 blk-mq: pass a flags argument to elevator_type->insert_requests
Instead of passing a bool at_head, pass down the full flags from the
blk_mq_insert_request interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
2b5976134b blk-mq: pass a flags argument to blk_mq_request_bypass_insert
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
710fa3789e blk-mq: pass a flags argument to blk_mq_insert_request
Replace the at_head bool with a flags argument that so far only contains
a single BLK_MQ_INSERT_AT_HEAD value.  This makes it much easier to grep
for head insertions into the blk-mq dispatch queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
214a441805 blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_list
blk_mq_add_to_requeue_list takes a bool parameter to control how to kick
the requeue list at the end of the function.  Move the call to
blk_mq_kick_requeue_list to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
2394395cd5 blk-mq: don't run the hw_queue from blk_mq_request_bypass_insert
blk_mq_request_bypass_insert takes a bool parameter to control how to run
the queue at the end of the function.  Move the blk_mq_run_hw_queue call
to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
f0dbe6e88e blk-mq: don't run the hw_queue from blk_mq_insert_request
blk_mq_insert_request takes two bool parameters to control how to run
the queue at the end of the function.  Move the blk_mq_run_hw_queue call
to the callers that want it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
e1f44ac0d7 blk-mq: fold __blk_mq_try_issue_directly into its two callers
Due to the wildly different behavior based on the bypass_insert argument,
not a whole lot of code in __blk_mq_try_issue_directly is actually shared
between blk_mq_try_issue_directly and blk_mq_request_issue_directly.

Remove __blk_mq_try_issue_directly and fold the code into the two callers
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00
Christoph Hellwig
2b71b87707 blk-mq: factor out a blk_mq_get_budget_and_tag helper
Factor out a helper from __blk_mq_try_issue_directly in preparation
of folding that function into its two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13 06:52:30 -06:00