IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Add a mark_dead method to blk_holder_ops that is called from blk_mark_disk_dead
to notify the holder that the block device it is using has been marked dead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims. It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is only a single caller left, so fold the loop into that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Delay dropping the block_devices for partitions in del_gendisk until
after the call to blk_mark_disk_dead, so that we can implementat
notification of removed devices in blk_mark_disk_dead.
This requires splitting a lower-level drop_partition helper out of
delete_partition and using that from del_gendisk, while having a
common loop for the whole device and partitions that calls
remove_inode_hash, fsync_bdev and __invalidate_device before the
call to blk_mark_disk_dead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the call to remove_inode_hash to the beginning of delete_partition,
as we want to prevent opening a block_device that is about to be removed
ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check if GD_DEAD is already set in blk_mark_disk_dead, and don't
duplicate the work already done.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mark_disk_dead does very similar work a a section of del_gendisk:
- set the GD_DEAD flag
- set the capacity to zero
- start a queue drain
but del_gendisk also sets QUEUE_FLAG_DYING on the queue if it is owned by
the disk, sets the capacity to zero before starting the drain, and both
with sending a uevent and kernel message for this fake capacity change.
Move the exact logic from the more heavily used del_gendisk into
blk_mark_disk_dead and then call blk_mark_disk_dead from del_gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no reason for this lock to spin, and being able to sleep under
it will come in handy soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The long if/else chain obsfucates the actual logic. Tidy it up to be
more structured. Also drop the whole argument, as it can be trivially
derived from bdev using bdev_whole, and having the bdev_whole in the
function makes it easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move all the logic to release an exclusive claim into a helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nr_active counter continues to increase over time which causes the
blk_mq_get_tag to hang until the thread is rescheduled to a different
core despite there are still tags available.
kernel-stack
INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds
Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x351/0xa20
scheduler+0x5d/0xe0
io_schedule+0x42/0x70
blk_mq_get_tag+0x11a/0x2a0
? dequeue_task_stop+0x70/0x70
__blk_mq_alloc_requests+0x191/0x2e0
kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before
__blk_mq_free_request being called.
320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1
b'__blk_mq_free_request+0x1 [kernel]'
b'bt_iter+0x50 [kernel]'
b'blk_mq_queue_tag_busy_iter+0x318 [kernel]'
b'blk_mq_timeout_work+0x7c [kernel]'
b'process_one_work+0x1c4 [kernel]'
b'worker_thread+0x4d [kernel]'
b'kthread+0xe6 [kernel]'
b'ret_from_fork+0x1f [kernel]'
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, icq is tracked by both request_queue(icq->q_node) and
task(icq->ioc_node), and ioc_clear_queue() from elevator exit is not
safe because it can access the list without protection:
ioc_clear_queue ioc_release_fn
lock queue_lock
list_splice
/* move queue list to a local list */
unlock queue_lock
/*
* lock is released, the local list
* can be accessed through task exit.
*/
lock ioc->lock
while (!hlist_empty)
icq = hlist_entry
lock queue_lock
ioc_destroy_icq
delete icq->ioc_node
while (!list_empty)
icq = list_entry() list_del icq->q_node
/*
* This is not protected by any lock,
* list_entry concurrent with list_del
* is not safe.
*/
unlock queue_lock
unlock ioc->lock
Fix this problem by protecting list 'icq->q_node' by queue_lock from
ioc_clear_queue().
Reported-and-tested-by: Pradeep Pragallapati <quic_pragalla@quicinc.com>
Link: https://lore.kernel.org/lkml/20230517084434.18932-1-quic_pragalla@quicinc.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531073435.2923422-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The scsi driver function sd_read_block_characteristics() always calls
disk_set_zoned() to a disk zoned model correctly, in case the device
model changed. This is done even for regular disks to set the zoned
model to BLK_ZONED_NONE and free any zone related resources if the drive
previously was zoned.
This behavior significantly impact the time it takes to revalidate disks
on a large system as the call to disk_clear_zone_settings() done from
disk_set_zoned() for the BLK_ZONED_NONE case results in the device
request queued to be frozen, even if there are no zone resources to
free.
Avoid this overhead for non-zoned devices by not calling
disk_clear_zone_settings() in disk_set_zoned() if the device model
was already set to BLK_ZONED_NONE, which is always the case for regular
devices.
Reported by: Brian Bunker <brian@purestorage.com>
Fixes: 508aebb80527 ("block: introduce blk_queue_clear_zone_settings()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230529073237.1339862-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-7-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-6-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-5-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge splice bits as subsequent block cleanups and improvements for DIO
depend on them.
* for-6.5/splice: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
Since the dawn of time bio_check_eod has a check for a non-zero size of
the device. This doesn't really make any sense as we never want to send
I/O to a device that's been set to zero size, or never moved out of that.
I am a bit surprised we haven't caught this for a long time, but the
removal of the extra validation inside of zram caused syzbot to trip
over this issue recently. I've added a Fixes tag for that commit, but
the issue really goes back way before git history.
Fixes: 9fe95babc742 ("zram: remove valid_io_request")
Reported-by: syzbot+b8d61a58b7c7ebd2c8e0@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230524060538.1593686-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 50e34d78815e ("block: disable the elevator int del_gendisk")
move rq_qos_exit() from disk_release() to del_gendisk(), this will
introduce some problems:
1) If rq_qos_add() is triggered by enabling iocost/iolatency through
cgroupfs, then it can concurrent with del_gendisk(), it's not safe to
write 'q->rq_qos' concurrently.
2) Activate cgroup policy that is relied on rq_qos will call
rq_qos_add() and blkcg_activate_policy(), and if rq_qos_exit() is
called in the middle, null-ptr-dereference will be triggered in
blkcg_activate_policy().
3) blkg_conf_open_bdev() can call blkdev_get_no_open() first to find the
disk, then if rq_qos_exit() from del_gendisk() is done before
rq_qos_add(), then memory will be leaked.
This patch add a new disk level mutex 'rq_qos_mutex':
1) The lock will protect rq_qos_exit() directly.
2) For wbt that doesn't relied on blk-cgroup, rq_qos_add() can only be
called from disk initialization for now because wbt can't be
destructed until rq_qos_exit(), so it's safe not to protect wbt for
now. Hoever, in case that rq_qos dynamically destruction is supported
in the furture, this patch also protect rq_qos_add() from wbt_init()
directly, this is enough because blk-sysfs already synchronize
writers with disk removal.
3) For iocost and iolatency, in order to synchronize disk removal and
cgroup configuration, the lock is held after blkdev_get_no_open()
from blkg_conf_open_bdev(), and is released in blkg_conf_exit().
In order to fix the above memory leak, disk_live() is checked after
holding the new lock.
Fixes: 50e34d78815e ("block: disable the elevator int del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230414084008.2085155-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit <8af870aa5b847> ("block: enable bio caching use for passthru IO")
introduced bio-cache for passthru IO. In case when nr_vecs are greater
than BIO_INLINE_VECS, bio and bvecs are allocated from mempool (instead
of percpu cache) and REQ_ALLOC_CACHE is cleared. This causes the side
effect of not freeing bio/bvecs into mempool on completion.
This patch lets the passthru IO fallback to allocation using bio_kmalloc
when nr_vecs are greater than BIO_INLINE_VECS. The corresponding bio
is freed during call to blk_mq_map_bio_put during completion.
Cc: stable@vger.kernel.org # 6.1
fixes <8af870aa5b847> ("block: enable bio caching use for passthru IO")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230523111709.145676-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If multiple CPUs are sharing the same hardware queue, it can
cause leak in the active queue counter tracking when __blk_mq_tag_busy()
is executed simultaneously.
Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit b11d31ae01e6 ("blk-wbt: remove unnecessary check in
wbt_enable_default()") removes the checking of CONFIG_BLK_WBT_MQ by
mistake, which is used to control enable or disable wbt by default.
Fix the problem by adding back the checking. This patch also do a litter
cleanup to make related code more readable.
Fixes: b11d31ae01e6 ("blk-wbt: remove unnecessary check in wbt_enable_default()")
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/lkml/CAKXUXMzfKq_J9nKHGyr5P5rvUETY4B-fxoQD4sO+NYjFOfVtZA@mail.gmail.com/t/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230522121854.2928880-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Niklas Cassel <nks@flawful.org> says:
This series adds support for Command Duration Limits.
The series is based on linux tag: v6.4-rc1
The series can also be found in git: https://github.com/floatious/linux/commits/cdl-v7
=================
CDL in ATA / SCSI
=================
Command Duration Limits is defined in:
T13 ATA Command Set - 5 (ACS-5) and
T10 SCSI Primary Commands - 6 (SPC-6) respectively
(a simpler version of CDL is defined in T10 SPC-5).
CDL defines Duration Limits Descriptors (DLD).
7 DLDs for read commands and 7 DLDs for write commands.
Simply put, a DLD contains a limit and a policy.
A command can specify that a certain limit should be applied by setting
the DLD index field (3 bits, so 0-7) in the command itself.
The DLD index points to one of the 7 DLDs.
DLD index 0 means no descriptor, so no limit.
DLD index 1-7 means DLD 1-7.
A DLD can have a few different policies, but the two major ones are:
-Policy 0xF (abort), command will be completed with command aborted error
(ATA) or status CHECK CONDITION (SCSI), with sense data indicating that
the command timed out.
-Policy 0xD (complete-unavailable), command will be completed without
error (ATA) or status GOOD (SCSI), with sense data indicating that the
command timed out. Note that the command will not have transferred any
data to/from the device when the command timed out, even though the
command returned success.
Regardless of the CDL policy, in case of a CDL timeout, the I/O will
result in a -ETIME error to user-space.
The DLDs are defined in the CDL log page(s) and are readable and writable.
Reading and writing the CDL DLDs are outside the scope of the kernel.
If a user wants to read or write the descriptors, they can do so using a
user-space application that sends passthrough commands, such as cdl-tools:
https://github.com/westerndigitalcorporation/cdl-tools
================================
The introduction of ioprio hints
================================
What the kernel does provide, is a method to let I/O use one of the CDL DLDs
defined in the device. Note that the kernel will simply forward the DLD index
to the device, so the kernel currently does not know, nor does it need to know,
how the DLDs are defined inside the device.
The way that the CDL DLD index is supplied to the kernel is by introducing a
new 10 bit "ioprio hint" field within the existing 16 bit ioprio definition.
Currently, only 6 out of the 16 ioprio bits are in use, the remaining 10 bits
are unused, and are currently explicitly disallowed to be set by the kernel.
For now, we only add ioprio hints representing CDL DLD index 1-7. Additional
ioprio hints for other QoS features could be defined in the future.
A theoretical future work could be to make an I/O scheduler aware of these
hints. E.g. for CDL, an I/O scheduler could make use of the duration limit
in each descriptor, and take that information into account while scheduling
commands. Right now, the ioprio hints will be ignored by the I/O schedulers.
==============================
How to use CDL from user-space
==============================
Since CDL is mutually exclusive with NCQ priority
(see ncq_prio_enable and sas_ncq_prio_enable in
Documentation/ABI/testing/sysfs-block-device),
CDL has to be explicitly enabled using:
echo 1 > /sys/block/$bdev/device/cdl_enable
Since the ioprio hints are supplied through the existing I/O priority API,
it should be simple for an application to make use of the ioprio hints.
It simply has to reuse one of the new macros defined in
include/uapi/linux/ioprio.h: IOPRIO_PRIO_HINT() or IOPRIO_PRIO_VALUE_HINT(),
and supply one of the new hints defined in include/uapi/linux/ioprio.h:
IOPRIO_HINT_DEV_DURATION_LIMIT_[1-7], which indicates that the I/O should
use the corresponding CDL DLD index 1-7.
By reusing the I/O priority API, the user can both define a DLD to use per
AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread
(ioprio_set()).
=======
Testing
=======
With the following fio patches:
https://github.com/floatious/fio/commits/cdl
fio adds support for ioprio hints, such that CDL can be tested using e.g.:
fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_hint=DLD_index
A simple way to test is to use a DLD with a very short duration limit,
and send large reads. Regardless of the CDL policy, in case of a CDL
timeout, the I/O will result in a -ETIME error to user-space.
We also provide a CDL test suite located in the cdl-tools repo, see:
https://github.com/westerndigitalcorporation/cdl-tools#testing-a-system-command-duration-limits-support
We have tested this patch series using:
-real hardware
-the following QEMU implementation:
https://github.com/floatious/qemu/tree/cdl
(NOTE: the QEMU implementation requires you to define the CDL policy at compile
time, so you currently need to recompile QEMU when switching between policies.)
===================
Further information
===================
For further information about CDL, see Damien's slides:
Presented at SDC 2021:
https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
Presented at Lund Linux Con 2022:
https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing
================
Changes since V6
================
-Rebased series on v6.4-rc1.
-Picked up Reviewed-by tags from Hannes (Thank you Hannes!)
-Picked up Reviewed-by tag from Christoph (Thank you Christoph!)
-Changed KernelVersion from 6.4 to 6.5 for new sysfs attributes.
For older change logs, see previous patch series versions:
https://lore.kernel.org/linux-scsi/20230406113252.41211-1-nks@flawful.org/https://lore.kernel.org/linux-scsi/20230404182428.715140-1-nks@flawful.org/https://lore.kernel.org/linux-scsi/20230309215516.3800571-1-niklas.cassel@wdc.com/https://lore.kernel.org/linux-scsi/20230124190308.127318-1-niklas.cassel@wdc.com/https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/
Link: https://lore.kernel.org/r/20230511011356.227789-1-nks@flawful.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Introduce the new block I/O status BLK_STS_DURATION_LIMIT for LLDDs to
report command that failed due to a command duration limit being
exceeded. This new status is mapped to the ETIME error code to allow users
to differentiate "soft" duration limit failures from other more serious
hardware related errors.
If we compare BLK_STS_DURATION_LIMIT with BLK_STS_TIMEOUT:
-BLK_STS_DURATION_LIMIT means that the drive gave a reply indicating that
the command duration limit was exceeded before the command could be
completed. This I/O status is mapped to ETIME for user space.
-BLK_STS_TIMEOUT means that the drive never gave a reply at all.
This I/O status is mapped to ETIMEDOUT for user space.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Link: https://lore.kernel.org/r/20230511011356.227789-4-nks@flawful.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The I/O priority user interface defines the 16-bits ioprio values as the
combination of the upper 3-bits for an I/O priority class and the lower
13-bits as priority data. However, the kernel only uses the lower 3-bits of
the priority data to define priority levels for the RT and BE priority
classes. The data part of an ioprio value is completely ignored for the
IDLE and NONE classes. This is enforced by checks done in
ioprio_check_cap(), which is called for all paths that allow defining an
I/O priority for I/Os: the per-context ioprio_set() system call, aio
interface and io_uring interface.
Clarify this fact in the uapi ioprio.h header file and introduce the
IOPRIO_PRIO_LEVEL_MASK and IOPRIO_PRIO_LEVEL() macros for users to define
and get priority levels in an ioprio value. The coarser macro
IOPRIO_PRIO_DATA() is retained for backward compatibility with old
applications already using it. There is no functional change introduced
with this.
In-kernel users of the IOPRIO_PRIO_DATA() macro which are explicitly
handling I/O priority data as a priority level are modified to use the new
IOPRIO_PRIO_LEVEL() macro without any functional change. Since f2fs is the
only user of this macro not explicitly using that value as a priority
level, it is left unchanged.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Link: https://lore.kernel.org/r/20230511011356.227789-2-nks@flawful.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Mike Christie <michael.christie@oracle.com> says:
The patches in this thread allow us to use the block pr_ops with LIO's
target_core_iblock module to support cluster applications in VMs. They
were built over Linus's tree. They also apply over linux-next and
Martin's tree and Jens's trees.
Currently, to use windows clustering or linux clustering (pacemaker +
cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you
have to use tcmu or pscsi or use a cluster aware FS/framework for the
LIO pr file. Setting up a cluster FS/framework is pain and waste when
your real backend device is already a distributed device, and pscsi
and tcmu are nice for specific use cases, but iblock gives you the
best performance and allows you to use stacked devices like
dm-multipath. So these patches allow iblock to work like pscsi/tcmu
where they can pass a PR command to the backend module. And then
iblock will use the pr_ops to pass the PR command to the real devices
similar to what we do for unmap today.
The patches are separated in the following groups:
Patch 1 - 2:
- Add block layer callouts for reading reservations and rename reservation
error code.
Patch 3 - 5:
- SCSI support for new callouts.
Patch 6:
- DM support for new callouts.
Patch 7 - 13:
- NVMe support for new callouts.
Patch 14 - 18:
- LIO support for new callouts.
This patchset has been tested with the libiscsi PGR ops and with
window's failover cluster verification test. Note that for scsi
backend devices we need this patchset:
https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7
to handle UAs. To reduce the size of this patchset that's being done
separately to make reviewing easier. And to make merging easier this
patchset and the one above do not have any conflicts so can be merged
in different trees.
Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
For direct I/O writes that issues more than a single bio, the plugging
is already done in __blkdev_direct_IO.
For synchronous buffered writes the plugging is done deep down in
writeback_inodes_wb / wb_writeback.
For the other cases there is no point in plugging as as single bio or no
bio at all is submitted.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230520044503.334444-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
User should not be able to write block device if it is read-only at
block level (e.g force_ro attribute). This is ensured in the regular
fops write operation (blkdev_write_iter) but not when writing via
user mapping (mmap), allowing user to actually write a read-only
block device via a PROT_WRITE mapping.
Example: This can lead to integrity issue of eMMC boot partition
(e.g mmcblk0boot0) which is read-only by default.
To fix this issue, simply deny shared writable mapping if the block
is readonly.
Note: Block remains writable if switch to read-only is performed
after the initial mapping, but this is expected behavior according
to commit a32e236eb93e ("Partially revert "block: fail op_is_write()
requests to read-only partitions"")'.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230510074223.991297-1-loic.poulain@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check bfqq->dispatched for each BFQ queue instead of checking it for an
invalid bfqq pointer.
Fixes: 3e49c1e4a615 ("block: BFQ: Add several invariant checks")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230519220347.3643295-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently both requeues of commands that were already sent to the driver
and flush commands submitted from the flush state machine share the same
requeue_list struct request_queue, despite requeues doing head
insertions and flushes not. Switch to using two separate lists instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_flush_complete_seq currently queues requests that write data after
a pre-flush from the flush state machine at the head of the queue.
This doesn't really make sense, as the original request bypassed all
queue lists by directly diverting to blk_insert_flush from
blk_mq_submit_bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Requests with the FUA bit on hardware without FUA support need a post
flush before returning to the caller, but they can still be sent using
the normal I/O path after initializing the flush-related fields and
end I/O handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Send write requests issued by the flush state machine through the normal
I/O submission path including the I/O scheduler (if present) so that I/O
scheduler policies are applied to writes with the FUA flag set.
Separate the I/O scheduler members from the flush members in struct
request since now a request may pass through both an I/O scheduler
and the flush machinery.
Note that the actual flush requests, which have no bio attached to the
request still bypass the I/O schedulers.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[hch: rebased]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If blk_insert_flush decides that a command does not need to use the
flush state machine, return false and let blk_mq_submit_bio handle
it the normal way (including using an I/O scheduler) instead of doing
a bypass insert.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230519044050.107790-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a switch statement to decide on the disposition of a flush request
instead of multiple if statements, out of which one does checks that are
more complex than required. Also warn on a malformed request early
on instead of doing a BUG_ON later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230519044050.107790-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out a helper from blk_insert_flush that initializes the flush
machine related fields in struct request, and don't bother with the
full memset as there's just a few fields to initialize, and all but
one already have explicit initializers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230519044050.107790-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If anything goes wrong with the counters that track the number of
requests, I/O locks up. Make such scenarios easier to debug by adding
invariant checks for the request counters. Additionally, check that
BFQ queues are empty before these are freed.
Cc: Jan Kara <jack@suse.cz>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before dispatching a zoned write from the FIFO list, check whether there
are any zoned writes in the RB-tree with a lower LBA for the same zone.
This patch ensures that zoned writes happen in order even if at_head is
set for some writes for a zone and not for others.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230517174230.897144-12-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Start dispatching from the start of a zone instead of from the starting
position of the most recently dispatched request.
If a zoned write is requeued with an LBA that is lower than already
inserted zoned writes, make sure that it is submitted first.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230517174230.897144-11-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Track the position (sector_t) of the most recently dispatched request
instead of tracking a pointer to the next request to dispatch. This
patch is the basis for patch "Handle requeued requests correctly".
Without this patch it would be significantly more complicated to make
sure that zoned writes are dispatched in LBA order per zone.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230517174230.897144-10-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_free_requests() calls dd_finish_request() indirectly. Prevent
nested locking of dd->lock and dd->zone_lock by moving the code for
freeing requests.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230517174230.897144-9-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>