6679 Commits

Author SHA1 Message Date
Kemeng Shi
47df9ce95c blk-mq: Fix potential io hung for shared sbitmap per tagset
Commit f906a6a0f4268 ("blk-mq: improve tag waiting setup for non-shared
tags") mark restart for unshared tags for improvement. At that time,
tags is only shared betweens queues and we can check if tags is shared
by test BLK_MQ_F_TAG_SHARED.
Afterwards, commit 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per
tagset") enabled tags share betweens hctxs inside a queue. We only
mark restart for shared hctxs inside a queue and may cause io hung if
there is no tag currently allocated by hctxs going to be marked restart.
Wait on sbitmap_queue instead of mark restart for shared hctxs case to
fix this.

Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 09:22:28 -07:00
Kemeng Shi
98b99e9412 blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait
For shared queues case, we will only wait on bitmap_tags if we fail to get
driver tag. However, rq could be from breserved_tags, then two problems
will occur:
1. io hung if no tag is currently allocated from bitmap_tags.
2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is
freed to breserved_tags.
Wait on the bitmap which rq from to fix this.

Fixes: f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 09:22:28 -07:00
Kemeng Shi
c31e76bcc3 blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctx
Commit 97889f9ac24f8 ("blk-mq: remove synchronize_rcu() from
blk_mq_del_queue_tag_set()") remove handle of TAG_SHARED in restart,
then shared_hctx_restart counted for how many hardware queues are marked
for restart is removed too.
Remove the stale comment that we still count hardware queues need restart.

Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 09:22:28 -07:00
Kemeng Shi
6ee858a3d3 blk-mq: avoid sleep in blk_mq_alloc_request_hctx
Commit 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") add
blk_mq_alloc_request_hctx to send commands to a specific queue. If
BLK_MQ_REQ_NOWAIT is not set in tag allocation, we may change to different
hctx after sleep and get tag from unexpected hctx. So BLK_MQ_REQ_NOWAIT
must be set in flags for blk_mq_alloc_request_hctx.
After commit 600c3b0cea784 ("blk-mq: open code __blk_mq_alloc_request in
blk_mq_alloc_request_hctx"), blk_mq_alloc_request_hctx return -EINVAL
if both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED are not set instead of
if BLK_MQ_REQ_NOWAIT is not set. So if BLK_MQ_REQ_NOWAIT is not set and
BLK_MQ_REQ_RESERVED is set, blk_mq_alloc_request_hctx could alloc tag
from unexpected hctx. I guess what we need here is that return -EINVAL
if either BLK_MQ_REQ_NOWAIT or BLK_MQ_REQ_RESERVED is not set.

Currently both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED will be set if
specific hctx is needed in nvme_auth_submit, nvmf_connect_io_queue
and nvmf_connect_admin_queue. Fix the potential BLK_MQ_REQ_NOWAIT missed
case in future.

Fixes: 600c3b0cea78 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 09:22:28 -07:00
Christoph Hellwig
e81cd5a983 block: stub out and deprecated the capability attribute on the gendisk
The capability attribute was added in 2017 to expose the kernel internal
GENHD_FL_MEDIA_CHANGE_NOTIFY to userspace without ever adding a value to
an UAPI header, and without ever setting it in any driver until it was
finally removed in Linux 5.7.

Deprecate the file and always return 0 instead of exposing the other
internal and frequently renumbered other gendisk flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230203150209.3199115-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 08:44:55 -07:00
Christoph Hellwig
28e538a309 blk-cgroup: fix freeing NULL blkg in blkg_create
new_blkg can be NULL if the caller didn't pass in a pre-allocated blkg.
Don't try to free it in that case.

Fixes: 27b642b07a4a ("blk-cgroup: simplify blkg freeing from initialization failure paths")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230206150201.3438972-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 08:43:23 -07:00
Christoph Hellwig
d58cdfae6a block: factor out a bvec_set_page helper
Add a helper to initialize a bvec based of a page pointer.  This will help
removing various open code bvec initializations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:54 -07:00
Christoph Hellwig
3f13ab7c80 blk-cgroup: move the cgroup information to struct gendisk
cgroup information only makes sense on a live gendisk that allows
file system I/O (which includes the raw block device).  So move over
the cgroup related members.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
479664cee1 blk-cgroup: pass a gendisk to blkg_lookup
Pass a gendisk to blkg_lookup and use that to find the match as part
of phasing out usage of the request_queue in the blk-cgroup code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
0a0b4f79db blk-cgroup: pass a gendisk to pd_alloc_fn
No need to the request_queue here, pass a gendisk and extract the
node ids from that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
40e4996ec0 blk-cgroup: pass a gendisk to blkcg_{de,}activate_policy
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
ba91c849fa blk-rq-qos: store a gendisk instead of request_queue in struct rq_qos
This is what about half of the users already want, and it's only going to
grow more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
3963d84df7 blk-rq-qos: constify rq_qos_ops
These op vectors are constant, so mark them const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
ce57b55860 blk-rq-qos: make rq_qos_add and rq_qos_del more useful
Switch to passing a gendisk, and make rq_qos_add initialize all required
fields and drop the not required q argument from rq_qos_del.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
b494f9c566 blk-rq-qos: move rq_qos_add and rq_qos_del out of line
These two functions are rather larger and not in a fast path, so move
them out of line.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
4e1d91ae87 blk-wbt: open code wbt_queue_depth_changed in wbt_init
wbt_queue_depth_changed just updates a field and calls another function.
Open code it in wbt_init, so that the local queue variable can be used
instead of the one stored in the rq_qos.  This will allow delaying that
rq_qos->queue assignment in a subsequent patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
0bc65bd41d blk-wbt: move private information from blk-wbt.h to blk-wbt.c
A large part of blk-wbt.h is only used in blk-wbt.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
958f296547 blk-wbt: pass a gendisk to wbt_init
Pass a gendisk to wbt_init to prepare for phasing out usage of the
request_queue in the blk-cgroup code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
04aad37be1 blk-wbt: pass a gendisk to wbt_{enable,disable}_default
Pass a gendisk to wbt_enable_default and wbt_disable_default to
prepare for phasing out usage of the request_queue in the blk-cgroup
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
f05837ed73 blk-cgroup: store a gendisk to throttle in struct task_struct
Switch from a request_queue pointer and reference to a gendisk once
for the throttle information in struct task_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
84d7d462b1 blk-cgroup: pin the gendisk in struct blkcg_gq
Currently each blkcg_gq holds a request_queue reference, which is what
is used in the policies.  But a lot of these interfaces will move over to
use a gendisk, so store a disk in struct blkcg_gq and hold a reference to
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
180b04d450 blk-cgroup: remove the !bdi->dev check in blkg_dev_name
bdi_dev_name already performs the same check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
27b642b07a blk-cgroup: simplify blkg freeing from initialization failure paths
There is no need to delay freeing a blkg to a workqueue when freeing it
after an initialization failure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:05 -07:00
Christoph Hellwig
0b6f93bdf0 blk-cgroup: improve error unwinding in blkg_alloc
Unwind only the previous initialization steps that happened in blkg_alloc
using goto based unwinding.  This avoids the need for the !queue special
case in blkg_free and thus ensures that any blkg seens outside of
blkg_alloc is always fully constructed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:04 -07:00
Christoph Hellwig
178fa7d498 blk-cgroup: delay blk-cgroup initialization until add_disk
There is no need to initialize the cgroup code before the disk is marked
live.  Moving the cgroup initialization earlier will help to have a
fully initialized struct device in the gendisk for the cgroup code to
use in the future.  Similarly tear the cgroup information down in
del_gendisk to be symmetric and because none of the cgroup tracking is
needed once non-passthrough I/O stops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:04 -07:00
Christoph Hellwig
a886001c2d block: don't call blk_throtl_stat_add for non-READ/WRITE commands
blk_throtl_stat_add is called from blk_stat_add explicitly, unlike the
other stats that go through q->stats->callbacks.  To prepare for cgroup
data moving to the gendisk, ensure blk_throtl_stat_add is only called
for the plain READ and WRITE commands that it actually handles internally,
as blk_stat_add can also be called for passthrough commands on queues that
do not have a gendisk associated with them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:04 -07:00
Ulf Hansson
4a6a7bc21d block: Default to use cgroup support for BFQ
Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we
most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's
make it default y.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30 09:42:42 -07:00
Kemeng Shi
323745a3aa block, bfq: remove unused bfq_wr_max_time in struct bfq_data
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed.
It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never
meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
87c971de81 block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq
We jump to tag only for returning current rq. Return directly to
remove this tag.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
433d4b03e7 block, bfq: remove redundant check in bfq_put_cooperator
We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
86f8382e6d block, bfq: remove unnecessary dereference to get async_bfqq
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
8ac2e43c35 block, bfq: use helper macro RQ_BFQQ to get bfqq of request
Use helper macro RQ_BFQQ to get bfqq of request.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
1c970450a7 block, bfq: initialize bfqq->decrease_time_jif correctly
Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.

To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.

Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
bebeb9e582 block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow
Parameter reason is never used, just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi
0c3e09e885 block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.

Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
 -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
 -For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Yu Kuai
f1c006f1c6 blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()
Currently parent pd can be freed before child pd:

t1: remove cgroup C1
blkcg_destroy_blkgs
 blkg_destroy
  list_del_init(&blkg->q_node)
  // remove blkg from queue list
  percpu_ref_kill(&blkg->refcnt)
   blkg_release
    call_rcu

t2: from t1
__blkg_release
 blkg_free
  schedule_work
			t4: deactivate policy
			blkcg_deactivate_policy
			 pd_free_fn
			 // parent of C1 is freed first
t3: from t2
 blkg_free_workfn
  pd_free_fn

If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.

Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00
Yu Kuai
dfd6200a09 blk-cgroup: support to track if policy is online
A new field 'online' is added to blkg_policy_data to fix following
2 problem:

1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT'
   failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again
   without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with
   it, and pd_offline_fn() will be called without pd_init_fn() and
   pd_online_fn(). This way null-ptr-deference can be triggered.

2) In order to synchronize pd_free_fn() from blkg_free_workfn() and
   blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be
   delayed to blkg_free_workfn(), hence pd_offline_fn() can be called
   first in blkg_destroy(), and then blkcg_deactivate_policy() will
   call it again, we must prevent it.

The new field 'online' will be set after pd_online_fn() and will be
cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only
be called if 'online' is set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00
Yu Kuai
c7241babf0 blk-cgroup: dropping parent refcount after pd_free_fn() is done
Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().

Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.

This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.

BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00
Zhong Jinghua
b36781034c blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store
We found that the blk_mq_hw_sysfs_store interface has no place to use.
The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses
the show method and does not use the store method.

Since this patch:
4a46f05ebf99 ("blk-mq: move hctx and ctx counters from sysfs to debugfs")
moved the store method to debugfs, the store method is not used anymore.

So let me do some tiny work to clean up unused code.

Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:35 -07:00
Jens Axboe
33391eecd6 block: treat poll queue enter similarly to timeouts
We ran into an issue where a production workload would randomly grind to
a halt and not continue until the pending IO had timed out. This turned
out to be a complicated interaction between queue freezing and polled
IO:

1) You have an application that does polled IO. At any point in time,
   there may be polled IO pending.

2) You have a monitoring application that issues a passthrough command,
   which is marked with side effects such that it needs to freeze the
   queue.

3) Passthrough command is started, which calls blk_freeze_queue_start()
   on the device. At this point the queue is marked frozen, and any
   attempt to enter the queue will fail (for non-blocking) or block.

4) Now the driver calls blk_mq_freeze_queue_wait(), which will return
   when the queue is quiesced and pending IO has completed.

5) The pending IO is polled IO, but any attempt to poll IO through the
   normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to
   bio_queue_enter() as the queue is frozen. Rather than poll and
   complete IO, the polling threads will sit in a tight loop attempting
   to poll, but failing to enter the queue to do so.

The end result is that progress for either application will be stalled
until all pending polled IO has timed out. This causes obvious huge
latency issues for the application doing polled IO, but also long delays
for passthrough command.

Fix this by treating queue enter for polled IO just like we do for
timeouts. This allows quick quiesce of the queue as we still poll and
complete this IO, while still disallowing queueing up new IO.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Li Nan
b326032965 blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params()
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated
by div64_u64. Vrate_min may be 1 greater than vrate_max if the input
values min and max of cost.qos are equal.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Li Nan
984af1e66b blk-iocost: fix divide by 0 error in calc_lcoefs()
echo max of u64 to cost.model can cause divide by 0 error.

  # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model

  divide error: 0000 [#1] PREEMPT SMP
  RIP: 0010:calc_lcoefs+0x4c/0xc0
  Call Trace:
   <TASK>
   ioc_refresh_params+0x2b3/0x4f0
   ioc_cost_model_write+0x3cb/0x4c0
   ? _copy_from_iter+0x6d/0x6c0
   ? kernfs_fop_write_iter+0xfc/0x270
   cgroup_file_write+0xa0/0x200
   kernfs_fop_write_iter+0x17d/0x270
   vfs_write+0x414/0x620
   ksys_write+0x73/0x160
   __x64_sys_write+0x1e/0x30
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL,
overflow would happen if bps plus IOC_PAGE_SIZE is greater than
ULLONG_MAX, it can cause divide by 0 error.

Fix the problem by setting basecost

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Yu Kuai
35198e3230 blk-iocost: read params inside lock in sysfs apis
Otherwise, user might get abnormal values if params is updated
concurrently.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Yu Kuai
235a5a83f6 blk-iocost: don't allow to configure bio based device
iocost is based on rq_qos, which can only work for request based device,
thus it doesn't make sense to configure iocost for bio based device.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Yu Kuai
7b7c5ae440 blk-iocost: check return value of match_u64()
This patch fixs that the return value of match_u64() from ioc_qos_write()
is not checked,

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Arnd Bergmann
5f2779dfa7 blk-iocost: avoid 64-bit division in ioc_timer_fn
The behavior of 'enum' types has changed in gcc-13, so now the
UNBUSY_THR_PCT constant is interpreted as a 64-bit number because
it is defined as part of the same enum definition as some other
constants that do not fit within a 32-bit integer. This in turn
leads to some inefficient code on 32-bit architectures as well
as a link error:

arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn':
blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod'

Split the enum definition to keep the 64-bit timing constants in
a separate enum type from those constants that can clearly fit
within a smaller type.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Pankaj Raghav
e29b210021 block: add a new helper bdev_{is_zone_start, offset_from_zone_start}
Instead of open coding to check for zone start, add a helper to improve
readability and store the logic in one place.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Anuj Gupta
7e2e355dd9 block: extend bio-cache for non-polled requests
This patch modifies the present check, so that bio-cache is not limited
to iopoll.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Jens Axboe
67d59247d4 block: don't allow multiple bios for IOCB_NOWAIT issue
If we're doing a large IO request which needs to be split into multiple
bios for issue, then we can run into the same situation as the below
marked commit fixes - parts will complete just fine, one or more parts
will fail to allocate a request. This will result in a partially
completed read or write request, where the caller gets EAGAIN even though
parts of the IO completed just fine.

Do the same for large bios as we do for splits - fail a NOWAIT request
with EAGAIN. This isn't technically fixing an issue in the below marked
patch, but for stable purposes, we should have either none of them or
both.

This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")

Cc: stable@vger.kernel.org # 5.15+
Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
Link: https://github.com/axboe/liburing/issues/766
Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Jens Axboe
a3df2e456c block: add a BUILD_BUG_ON() for adding more bio flags than we have space
We have BIO_FLAG_LAST in the enum for bio specific flags, but it's
not used to check that we're not exceeding the size of them. Add
such a check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:33 -07:00