2019-04-30 21:42:43 +03:00
// SPDX-License-Identifier: GPL-2.0
2005-04-17 02:20:36 +04:00
/*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
* Copyright ( C ) 1994 , Karl Keyte : Added support for disk statistics
* Elevator latency , ( C ) 2000 Andrea Arcangeli < andrea @ suse . de > SuSE
* Queue request tables / lock , selectable elevator , Jens Axboe < axboe @ suse . de >
2008-01-31 15:03:55 +03:00
* kernel - doc documentation started by NeilBrown < neilb @ cse . unsw . edu . au >
* - July2000
2005-04-17 02:20:36 +04:00
* bio rewrite , highmem i / o , etc , Jens Axboe < axboe @ suse . de > - may 2001
*/
/*
* This handles all read / write requests to block devices
*/
# include <linux/kernel.h>
# include <linux/module.h>
# include <linux/bio.h>
# include <linux/blkdev.h>
2020-12-09 08:29:51 +03:00
# include <linux/blk-pm.h>
2021-09-20 15:33:27 +03:00
# include <linux/blk-integrity.h>
2005-04-17 02:20:36 +04:00
# include <linux/highmem.h>
# include <linux/mm.h>
mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
The only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.
I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.
These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).
This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 07:46:07 +03:00
# include <linux/pagemap.h>
2005-04-17 02:20:36 +04:00
# include <linux/kernel_stat.h>
# include <linux/string.h>
# include <linux/init.h>
# include <linux/completion.h>
# include <linux/slab.h>
# include <linux/swap.h>
# include <linux/writeback.h>
2006-12-10 13:19:35 +03:00
# include <linux/task_io_accounting_ops.h>
2006-12-08 13:39:46 +03:00
# include <linux/fault-inject.h>
2011-03-08 15:19:51 +03:00
# include <linux/list_sort.h>
2011-10-19 16:32:38 +04:00
# include <linux/delay.h>
2012-04-20 03:29:22 +04:00
# include <linux/ratelimit.h>
2013-03-23 07:42:26 +04:00
# include <linux/pm_runtime.h>
2019-09-16 18:44:29 +03:00
# include <linux/t10-pi.h>
2017-02-01 01:53:20 +03:00
# include <linux/debugfs.h>
2018-02-07 01:05:39 +03:00
# include <linux/bpf.h>
2021-11-23 21:53:12 +03:00
# include <linux/part_stat.h>
2020-05-14 11:45:09 +03:00
# include <linux/sched/sysctl.h>
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 03:37:18 +03:00
# include <linux/blk-crypto.h>
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 09:43:05 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/block.h>
2005-04-17 02:20:36 +04:00
2008-01-29 16:51:59 +03:00
# include "blk.h"
2021-11-23 21:53:08 +03:00
# include "blk-mq-sched.h"
2018-09-27 00:01:03 +03:00
# include "blk-pm.h"
2022-02-11 13:11:49 +03:00
# include "blk-cgroup.h"
2021-10-05 18:11:56 +03:00
# include "blk-throttle.h"
2008-01-29 16:51:59 +03:00
2017-02-01 01:53:20 +03:00
struct dentry * blk_debugfs_root ;
2010-11-16 14:52:38 +03:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_bio_remap ) ;
2009-10-01 23:16:13 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_rq_remap ) ;
2013-04-18 20:00:26 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_bio_complete ) ;
2014-04-28 22:30:52 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_split ) ;
2012-12-14 23:49:27 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_unplug ) ;
2021-02-22 08:29:59 +03:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_rq_insert ) ;
2008-11-26 13:59:56 +03:00
2022-11-14 07:26:36 +03:00
static DEFINE_IDA ( blk_queue_ida ) ;
2011-12-14 03:33:37 +04:00
2005-04-17 02:20:36 +04:00
/*
* For queue allocation
*/
2022-11-14 07:26:36 +03:00
static struct kmem_cache * blk_requestq_cachep ;
2005-04-17 02:20:36 +04:00
/*
* Controlling structure to kblockd
*/
2006-01-09 18:02:34 +03:00
static struct workqueue_struct * kblockd_workqueue ;
2005-04-17 02:20:36 +04:00
2018-03-08 04:10:04 +03:00
/**
* blk_queue_flag_set - atomically set a queue flag
* @ flag : flag to be set
* @ q : request queue
*/
void blk_queue_flag_set ( unsigned int flag , struct request_queue * q )
{
2018-11-14 19:02:07 +03:00
set_bit ( flag , & q - > queue_flags ) ;
2018-03-08 04:10:04 +03:00
}
EXPORT_SYMBOL ( blk_queue_flag_set ) ;
/**
* blk_queue_flag_clear - atomically clear a queue flag
* @ flag : flag to be cleared
* @ q : request queue
*/
void blk_queue_flag_clear ( unsigned int flag , struct request_queue * q )
{
2018-11-14 19:02:07 +03:00
clear_bit ( flag , & q - > queue_flags ) ;
2018-03-08 04:10:04 +03:00
}
EXPORT_SYMBOL ( blk_queue_flag_clear ) ;
/**
* blk_queue_flag_test_and_set - atomically test and set a queue flag
* @ flag : flag to be set
* @ q : request queue
*
* Returns the previous value of @ flag - 0 if the flag was not set and 1 if
* the flag was already set .
*/
bool blk_queue_flag_test_and_set ( unsigned int flag , struct request_queue * q )
{
2018-11-14 19:02:07 +03:00
return test_and_set_bit ( flag , & q - > queue_flags ) ;
2018-03-08 04:10:04 +03:00
}
EXPORT_SYMBOL_GPL ( blk_queue_flag_test_and_set ) ;
2019-06-20 20:59:16 +03:00
# define REQ_OP_NAME(name) [REQ_OP_##name] = #name
static const char * const blk_op_name [ ] = {
REQ_OP_NAME ( READ ) ,
REQ_OP_NAME ( WRITE ) ,
REQ_OP_NAME ( FLUSH ) ,
REQ_OP_NAME ( DISCARD ) ,
REQ_OP_NAME ( SECURE_ERASE ) ,
REQ_OP_NAME ( ZONE_RESET ) ,
2019-08-01 20:26:36 +03:00
REQ_OP_NAME ( ZONE_RESET_ALL ) ,
2019-10-27 17:05:45 +03:00
REQ_OP_NAME ( ZONE_OPEN ) ,
REQ_OP_NAME ( ZONE_CLOSE ) ,
REQ_OP_NAME ( ZONE_FINISH ) ,
2020-05-12 11:55:47 +03:00
REQ_OP_NAME ( ZONE_APPEND ) ,
2019-06-20 20:59:16 +03:00
REQ_OP_NAME ( WRITE_ZEROES ) ,
REQ_OP_NAME ( DRV_IN ) ,
REQ_OP_NAME ( DRV_OUT ) ,
} ;
# undef REQ_OP_NAME
/**
* blk_op_str - Return string XXX in the REQ_OP_XXX .
* @ op : REQ_OP_XXX .
*
* Description : Centralize block layer function to convert REQ_OP_XXX into
* string format . Useful in the debugging and tracing bio or request . For
* invalid REQ_OP_XXX it returns string " UNKNOWN " .
*/
2022-07-14 21:06:28 +03:00
inline const char * blk_op_str ( enum req_op op )
2019-06-20 20:59:16 +03:00
{
const char * op_str = " UNKNOWN " ;
if ( op < ARRAY_SIZE ( blk_op_name ) & & blk_op_name [ op ] )
op_str = blk_op_name [ op ] ;
return op_str ;
}
EXPORT_SYMBOL_GPL ( blk_op_str ) ;
2017-06-03 10:38:04 +03:00
static const struct {
int errno ;
const char * name ;
} blk_errors [ ] = {
[ BLK_STS_OK ] = { 0 , " " } ,
[ BLK_STS_NOTSUPP ] = { - EOPNOTSUPP , " operation not supported " } ,
[ BLK_STS_TIMEOUT ] = { - ETIMEDOUT , " timeout " } ,
[ BLK_STS_NOSPC ] = { - ENOSPC , " critical space allocation " } ,
[ BLK_STS_TRANSPORT ] = { - ENOLINK , " recoverable transport " } ,
[ BLK_STS_TARGET ] = { - EREMOTEIO , " critical target " } ,
[ BLK_STS_NEXUS ] = { - EBADE , " critical nexus " } ,
[ BLK_STS_MEDIUM ] = { - ENODATA , " critical medium " } ,
[ BLK_STS_PROTECTION ] = { - EILSEQ , " protection " } ,
[ BLK_STS_RESOURCE ] = { - ENOMEM , " kernel resource " } ,
2018-01-31 06:04:57 +03:00
[ BLK_STS_DEV_RESOURCE ] = { - EBUSY , " device resource " } ,
2017-06-20 15:05:46 +03:00
[ BLK_STS_AGAIN ] = { - EAGAIN , " nonblocking retry " } ,
2022-02-03 22:28:26 +03:00
[ BLK_STS_OFFLINE ] = { - ENODEV , " device offline " } ,
2017-06-03 10:38:04 +03:00
2017-06-03 10:38:06 +03:00
/* device mapper special case, should not leak out: */
[ BLK_STS_DM_REQUEUE ] = { - EREMCHG , " dm internal retry " } ,
2020-09-24 23:53:28 +03:00
/* zone device specific errors */
[ BLK_STS_ZONE_OPEN_RESOURCE ] = { - ETOOMANYREFS , " open zones exceeded " } ,
[ BLK_STS_ZONE_ACTIVE_RESOURCE ] = { - EOVERFLOW , " active zones exceeded " } ,
2017-06-03 10:38:04 +03:00
/* everything else not covered above: */
[ BLK_STS_IOERR ] = { - EIO , " I/O " } ,
} ;
blk_status_t errno_to_blk_status ( int errno )
{
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( blk_errors ) ; i + + ) {
if ( blk_errors [ i ] . errno = = errno )
return ( __force blk_status_t ) i ;
}
return BLK_STS_IOERR ;
}
EXPORT_SYMBOL_GPL ( errno_to_blk_status ) ;
int blk_status_to_errno ( blk_status_t status )
{
int idx = ( __force int ) status ;
2017-06-21 20:55:46 +03:00
if ( WARN_ON_ONCE ( idx > = ARRAY_SIZE ( blk_errors ) ) )
2017-06-03 10:38:04 +03:00
return - EIO ;
return blk_errors [ idx ] . errno ;
}
EXPORT_SYMBOL_GPL ( blk_status_to_errno ) ;
2021-11-17 09:14:03 +03:00
const char * blk_status_to_str ( blk_status_t status )
2017-06-03 10:38:04 +03:00
{
int idx = ( __force int ) status ;
2017-06-21 20:55:46 +03:00
if ( WARN_ON_ONCE ( idx > = ARRAY_SIZE ( blk_errors ) ) )
2021-11-17 09:14:03 +03:00
return " <null> " ;
return blk_errors [ idx ] . name ;
2017-06-03 10:38:04 +03:00
}
2005-04-17 02:20:36 +04:00
/**
* blk_sync_queue - cancel any pending callbacks on a queue
* @ q : the queue
*
* Description :
* The block layer may perform asynchronous callback activity
* on a queue , such as calling the unplug function after a timeout .
* A block device may call blk_sync_queue to ensure that any
* such activity is cancelled , thus allowing it to release resources
2007-05-09 10:57:56 +04:00
* that the callbacks might use . The caller must already have made sure
2020-07-01 11:59:43 +03:00
* that its - > submit_bio will not re - add plugging prior to calling
2005-04-17 02:20:36 +04:00
* this function .
*
2011-03-03 03:05:33 +03:00
* This function does not cancel any asynchronous activity arising
2014-09-08 20:27:23 +04:00
* out of elevator or throttling code . That would require elevator_exit ( )
2012-03-06 01:15:12 +04:00
* and blkcg_exit_queue ( ) to be called with queue lock initialized .
2011-03-03 03:05:33 +03:00
*
2005-04-17 02:20:36 +04:00
*/
void blk_sync_queue ( struct request_queue * q )
{
2008-11-19 16:38:39 +03:00
del_timer_sync ( & q - > timeout ) ;
2017-10-19 20:00:48 +03:00
cancel_work_sync ( & q - > timeout_work ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_sync_queue ) ;
2017-11-09 21:49:57 +03:00
/**
2018-09-27 00:01:04 +03:00
* blk_set_pm_only - increment pm_only counter
2017-11-09 21:49:57 +03:00
* @ q : request queue pointer
*/
2018-09-27 00:01:04 +03:00
void blk_set_pm_only ( struct request_queue * q )
2017-11-09 21:49:57 +03:00
{
2018-09-27 00:01:04 +03:00
atomic_inc ( & q - > pm_only ) ;
2017-11-09 21:49:57 +03:00
}
2018-09-27 00:01:04 +03:00
EXPORT_SYMBOL_GPL ( blk_set_pm_only ) ;
2017-11-09 21:49:57 +03:00
2018-09-27 00:01:04 +03:00
void blk_clear_pm_only ( struct request_queue * q )
2017-11-09 21:49:57 +03:00
{
2018-09-27 00:01:04 +03:00
int pm_only ;
pm_only = atomic_dec_return ( & q - > pm_only ) ;
WARN_ON_ONCE ( pm_only < 0 ) ;
if ( pm_only = = 0 )
wake_up_all ( & q - > mq_freeze_wq ) ;
2017-11-09 21:49:57 +03:00
}
2018-09-27 00:01:04 +03:00
EXPORT_SYMBOL_GPL ( blk_clear_pm_only ) ;
2017-11-09 21:49:57 +03:00
2022-11-14 07:26:36 +03:00
static void blk_free_queue_rcu ( struct rcu_head * rcu_head )
{
2022-12-15 05:16:29 +03:00
struct request_queue * q = container_of ( rcu_head ,
struct request_queue , rcu_head ) ;
percpu_ref_exit ( & q - > q_usage_counter ) ;
kmem_cache_free ( blk_requestq_cachep , q ) ;
2022-11-14 07:26:36 +03:00
}
static void blk_free_queue ( struct request_queue * q )
{
if ( q - > poll_stat )
blk_stat_remove_callback ( q , q - > poll_cb ) ;
blk_stat_free_callback ( q - > poll_cb ) ;
blk_free_queue_stats ( q - > stats ) ;
kfree ( q - > poll_stat ) ;
if ( queue_is_mq ( q ) )
blk_mq_release ( q ) ;
ida_free ( & blk_queue_ida , q - > id ) ;
call_rcu ( & q - > rcu_head , blk_free_queue_rcu ) ;
}
2020-06-19 23:47:23 +03:00
/**
* blk_put_queue - decrement the request_queue refcount
* @ q : the request_queue structure to decrement the refcount for
*
2022-11-14 07:26:36 +03:00
* Decrements the refcount of the request_queue and free it when the refcount
* reaches 0.
2020-06-19 23:47:25 +03:00
*
2022-11-14 07:26:37 +03:00
* Context : Can sleep .
2020-06-19 23:47:23 +03:00
*/
2007-07-24 11:28:11 +04:00
void blk_put_queue ( struct request_queue * q )
2006-03-19 02:34:37 +03:00
{
2022-11-14 07:26:37 +03:00
might_sleep ( ) ;
2022-11-14 07:26:36 +03:00
if ( refcount_dec_and_test ( & q - > refs ) )
blk_free_queue ( q ) ;
2006-03-19 02:34:37 +03:00
}
2011-05-27 09:44:43 +04:00
EXPORT_SYMBOL ( blk_put_queue ) ;
2006-03-19 02:34:37 +03:00
2021-09-29 10:12:40 +03:00
void blk_queue_start_drain ( struct request_queue * q )
2014-12-23 00:04:42 +03:00
{
2017-03-27 15:06:58 +03:00
/*
* When queue DYING flag is set , we need to block new req
* entering queue , so we call blk_freeze_queue_start ( ) to
* prevent I / O from crossing blk_queue_enter ( ) .
*/
blk_freeze_queue_start ( q ) ;
2018-11-15 22:22:51 +03:00
if ( queue_is_mq ( q ) )
2014-12-23 00:04:42 +03:00
blk_mq_wake_waiters ( q ) ;
2017-11-09 21:49:53 +03:00
/* Make blk_queue_enter() reexamine the DYING flag. */
wake_up_all ( & q - > mq_freeze_wq ) ;
2014-12-23 00:04:42 +03:00
}
2021-09-29 10:12:40 +03:00
2017-11-09 21:49:58 +03:00
/**
* blk_queue_enter ( ) - try to increase q - > q_usage_counter
* @ q : request queue pointer
2020-12-09 08:29:50 +03:00
* @ flags : BLK_MQ_REQ_NOWAIT and / or BLK_MQ_REQ_PM
2017-11-09 21:49:58 +03:00
*/
2017-11-09 21:49:59 +03:00
int blk_queue_enter ( struct request_queue * q , blk_mq_req_flags_t flags )
2015-10-21 20:20:12 +03:00
{
2020-12-09 08:29:50 +03:00
const bool pm = flags & BLK_MQ_REQ_PM ;
2017-11-09 21:49:58 +03:00
2021-09-29 10:12:38 +03:00
while ( ! blk_try_enter_queue ( q , pm ) ) {
2017-11-09 21:49:58 +03:00
if ( flags & BLK_MQ_REQ_NOWAIT )
2022-09-12 19:53:25 +03:00
return - EAGAIN ;
2015-10-21 20:20:12 +03:00
2017-03-27 15:06:56 +03:00
/*
2021-09-29 10:12:38 +03:00
* read pair of barrier in blk_freeze_queue_start ( ) , we need to
* order reading __PERCPU_REF_DEAD flag of . q_usage_counter and
* reading . mq_freeze_depth or queue dying flag , otherwise the
* following wait may never return if the two reads are
* reordered .
2017-03-27 15:06:56 +03:00
*/
smp_rmb ( ) ;
2018-04-12 21:11:58 +03:00
wait_event ( q - > mq_freeze_wq ,
2019-05-21 06:25:55 +03:00
( ! q - > mq_freeze_depth & &
2020-12-09 08:29:51 +03:00
blk_pm_resume_queue ( pm , q ) ) | |
2018-04-12 21:11:58 +03:00
blk_queue_dying ( q ) ) ;
2015-10-21 20:20:12 +03:00
if ( blk_queue_dying ( q ) )
return - ENODEV ;
}
2021-09-29 10:12:38 +03:00
return 0 ;
2015-10-21 20:20:12 +03:00
}
2021-11-04 21:45:51 +03:00
int __bio_queue_enter ( struct request_queue * q , struct bio * bio )
2020-04-28 14:27:56 +03:00
{
2021-09-29 10:12:39 +03:00
while ( ! blk_try_enter_queue ( q , false ) ) {
2021-10-14 17:03:29 +03:00
struct gendisk * disk = bio - > bi_bdev - > bd_disk ;
2021-09-29 10:12:39 +03:00
if ( bio - > bi_opf & REQ_NOWAIT ) {
2021-09-29 10:12:40 +03:00
if ( test_bit ( GD_DEAD , & disk - > state ) )
2021-09-29 10:12:39 +03:00
goto dead ;
2020-04-28 14:27:56 +03:00
bio_wouldblock_error ( bio ) ;
2022-09-12 19:53:25 +03:00
return - EAGAIN ;
2021-09-29 10:12:39 +03:00
}
/*
* read pair of barrier in blk_freeze_queue_start ( ) , we need to
* order reading __PERCPU_REF_DEAD flag of . q_usage_counter and
* reading . mq_freeze_depth or queue dying flag , otherwise the
* following wait may never return if the two reads are
* reordered .
*/
smp_rmb ( ) ;
wait_event ( q - > mq_freeze_wq ,
( ! q - > mq_freeze_depth & &
blk_pm_resume_queue ( false , q ) ) | |
2021-09-29 10:12:40 +03:00
test_bit ( GD_DEAD , & disk - > state ) ) ;
if ( test_bit ( GD_DEAD , & disk - > state ) )
2021-09-29 10:12:39 +03:00
goto dead ;
2020-04-28 14:27:56 +03:00
}
2021-09-29 10:12:39 +03:00
return 0 ;
dead :
bio_io_error ( bio ) ;
return - ENODEV ;
2020-04-28 14:27:56 +03:00
}
2015-10-21 20:20:12 +03:00
void blk_queue_exit ( struct request_queue * q )
{
percpu_ref_put ( & q - > q_usage_counter ) ;
}
static void blk_queue_usage_counter_release ( struct percpu_ref * ref )
{
struct request_queue * q =
container_of ( ref , struct request_queue , q_usage_counter ) ;
wake_up_all ( & q - > mq_freeze_wq ) ;
}
2017-08-29 01:03:41 +03:00
static void blk_rq_timed_out_timer ( struct timer_list * t )
2015-10-30 15:57:30 +03:00
{
2017-08-29 01:03:41 +03:00
struct request_queue * q = from_timer ( q , t , timeout ) ;
2015-10-30 15:57:30 +03:00
kblockd_schedule_work ( & q - > timeout_work ) ;
}
2019-01-30 16:21:45 +03:00
static void blk_timeout_work ( struct work_struct * work )
{
}
2022-11-01 18:00:47 +03:00
struct request_queue * blk_alloc_queue ( int node_id )
2005-06-23 11:08:19 +04:00
{
2007-07-24 11:28:11 +04:00
struct request_queue * q ;
2005-06-23 11:08:19 +04:00
2022-11-01 18:00:47 +03:00
q = kmem_cache_alloc_node ( blk_requestq_cachep , GFP_KERNEL | __GFP_ZERO ,
node_id ) ;
2005-04-17 02:20:36 +04:00
if ( ! q )
return NULL ;
2018-05-31 20:11:36 +03:00
q - > last_merge = NULL ;
2022-06-15 11:18:16 +03:00
q - > id = ida_alloc ( & blk_queue_ida , GFP_KERNEL ) ;
2011-12-14 03:33:37 +04:00
if ( q - > id < 0 )
2022-11-01 18:00:47 +03:00
goto fail_q ;
2011-12-14 03:33:37 +04:00
2017-03-22 02:20:01 +03:00
q - > stats = blk_alloc_queue_stats ( ) ;
if ( ! q - > stats )
2022-07-27 19:22:57 +03:00
goto fail_id ;
2017-03-22 02:20:01 +03:00
2011-11-23 13:59:13 +04:00
q - > node = node_id ;
2009-06-12 16:42:56 +04:00
2021-10-05 13:23:39 +03:00
atomic_set ( & q - > nr_active_requests_shared_tags , 0 ) ;
2020-08-19 18:20:26 +03:00
2017-08-29 01:03:41 +03:00
timer_setup ( & q - > timeout , blk_rq_timed_out_timer , 0 ) ;
2019-01-30 16:21:45 +03:00
INIT_WORK ( & q - > timeout_work , blk_timeout_work ) ;
2011-12-14 03:33:41 +04:00
INIT_LIST_HEAD ( & q - > icq_list ) ;
2006-03-19 02:34:37 +03:00
2022-11-14 07:26:36 +03:00
refcount_set ( & q - > refs , 1 ) ;
2020-06-19 23:47:30 +03:00
mutex_init ( & q - > debugfs_mutex ) ;
2006-03-19 02:34:37 +03:00
mutex_init ( & q - > sysfs_lock ) ;
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 14:01:48 +03:00
mutex_init ( & q - > sysfs_dir_lock ) ;
2018-11-15 22:17:28 +03:00
spin_lock_init ( & q - > queue_lock ) ;
2011-03-03 03:04:42 +03:00
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
init_waitqueue_head ( & q - > mq_freeze_wq ) ;
2019-05-21 06:25:55 +03:00
mutex_init ( & q - > mq_freeze_lock ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
2015-10-21 20:20:12 +03:00
/*
* Init percpu_ref in atomic mode so that it ' s faster to shutdown .
* See blk_register_queue ( ) for details .
*/
if ( percpu_ref_init ( & q - > q_usage_counter ,
blk_queue_usage_counter_release ,
PERCPU_REF_INIT_ATOMIC , GFP_KERNEL ) )
2021-08-09 17:17:43 +03:00
goto fail_stats ;
2012-03-06 01:15:05 +04:00
2020-03-27 11:30:11 +03:00
blk_set_default_limits ( & q - > limits ) ;
2021-10-05 13:23:27 +03:00
q - > nr_requests = BLKDEV_DEFAULT_RQ ;
2020-03-27 11:30:11 +03:00
2005-04-17 02:20:36 +04:00
return q ;
2011-12-14 03:33:37 +04:00
2017-03-22 02:20:01 +03:00
fail_stats :
2021-08-09 17:17:43 +03:00
blk_free_queue_stats ( q - > stats ) ;
2011-12-14 03:33:37 +04:00
fail_id :
2022-06-15 11:18:16 +03:00
ida_free ( & blk_queue_ida , q - > id ) ;
2011-12-14 03:33:37 +04:00
fail_q :
2022-11-01 18:00:47 +03:00
kmem_cache_free ( blk_requestq_cachep , q ) ;
2011-12-14 03:33:37 +04:00
return NULL ;
2005-04-17 02:20:36 +04:00
}
2020-06-19 23:47:23 +03:00
/**
* blk_get_queue - increment the request_queue refcount
* @ q : the request_queue structure to increment the refcount for
*
* Increment the refcount of the request_queue kobject .
2020-06-19 23:47:24 +03:00
*
* Context : Any context .
2020-06-19 23:47:23 +03:00
*/
2011-12-14 03:33:38 +04:00
bool blk_get_queue ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
2022-07-21 09:34:32 +03:00
if ( unlikely ( blk_queue_dying ( q ) ) )
return false ;
2022-11-14 07:26:36 +03:00
refcount_inc ( & q - > refs ) ;
2022-07-21 09:34:32 +03:00
return true ;
2005-04-17 02:20:36 +04:00
}
2011-05-27 09:44:43 +04:00
EXPORT_SYMBOL ( blk_get_queue ) ;
2005-04-17 02:20:36 +04:00
2006-12-08 13:39:46 +03:00
# ifdef CONFIG_FAIL_MAKE_REQUEST
static DECLARE_FAULT_ATTR ( fail_make_request ) ;
static int __init setup_fail_make_request ( char * str )
{
return setup_fault_attr ( & fail_make_request , str ) ;
}
__setup ( " fail_make_request= " , setup_fail_make_request ) ;
2021-11-17 09:13:58 +03:00
bool should_fail_request ( struct block_device * part , unsigned int bytes )
2006-12-08 13:39:46 +03:00
{
2020-11-24 11:36:54 +03:00
return part - > bd_make_it_fail & & should_fail ( & fail_make_request , bytes ) ;
2006-12-08 13:39:46 +03:00
}
static int __init fail_make_request_debugfs ( void )
{
2011-08-04 03:21:01 +04:00
struct dentry * dir = fault_create_debugfs_attr ( " fail_make_request " ,
NULL , & fail_make_request ) ;
2014-04-11 11:58:56 +04:00
return PTR_ERR_OR_ZERO ( dir ) ;
2006-12-08 13:39:46 +03:00
}
late_initcall ( fail_make_request_debugfs ) ;
# endif /* CONFIG_FAIL_MAKE_REQUEST */
2022-09-05 13:27:54 +03:00
static inline void bio_check_ro ( struct bio * bio )
2018-01-11 16:09:11 +03:00
{
2021-01-24 13:02:35 +03:00
if ( op_is_write ( bio_op ( bio ) ) & & bdev_read_only ( bio - > bi_bdev ) ) {
2018-09-06 01:14:36 +03:00
if ( op_is_flush ( bio - > bi_opf ) & & ! bio_sectors ( bio ) )
2022-09-05 13:27:54 +03:00
return ;
2022-03-04 21:00:56 +03:00
pr_warn ( " Trying to write to read-only block-device %pg \n " ,
bio - > bi_bdev ) ;
Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.
The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.
This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=200439
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442
The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:
generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40
Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701c7, but letting the previously
uncaught request go through with a warning instead of stopping it.
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-03 22:22:09 +03:00
/* Older lvm-tools actually trigger this */
2018-01-11 16:09:11 +03:00
}
}
2018-02-07 01:05:39 +03:00
static noinline int should_fail_bio ( struct bio * bio )
{
2021-01-24 13:02:34 +03:00
if ( should_fail_request ( bdev_whole ( bio - > bi_bdev ) , bio - > bi_iter . bi_size ) )
2018-02-07 01:05:39 +03:00
return - EIO ;
return 0 ;
}
ALLOW_ERROR_INJECTION ( should_fail_bio , ERRNO ) ;
2018-03-14 18:56:53 +03:00
/*
* Check whether this bio extends beyond the end of the device or partition .
* This may well happen - the kernel calls bread ( ) without checking the size of
* the device , e . g . , when mounting a file system .
*/
2021-01-24 13:02:35 +03:00
static inline int bio_check_eod ( struct bio * bio )
2018-03-14 18:56:53 +03:00
{
2021-01-24 13:02:35 +03:00
sector_t maxsector = bdev_nr_sectors ( bio - > bi_bdev ) ;
2018-03-14 18:56:53 +03:00
unsigned int nr_sectors = bio_sectors ( bio ) ;
if ( nr_sectors & & maxsector & &
( nr_sectors > maxsector | |
bio - > bi_iter . bi_sector > maxsector - nr_sectors ) ) {
2022-03-04 21:00:57 +03:00
pr_info_ratelimited ( " %s: attempt to access beyond end of device \n "
2022-05-04 17:33:55 +03:00
" %pg: rw=%d, sector=%llu, nr_sectors = %u limit=%llu \n " ,
current - > comm , bio - > bi_bdev , bio - > bi_opf ,
bio - > bi_iter . bi_sector , nr_sectors , maxsector ) ;
2018-03-14 18:56:53 +03:00
return - EIO ;
}
return 0 ;
}
2017-08-23 20:10:32 +03:00
/*
* Remap block n of partition p to block n + start ( p ) of the disk .
*/
2021-01-24 13:02:35 +03:00
static int blk_partition_remap ( struct bio * bio )
2017-08-23 20:10:32 +03:00
{
2021-01-24 13:02:34 +03:00
struct block_device * p = bio - > bi_bdev ;
2017-08-23 20:10:32 +03:00
2018-03-14 18:56:53 +03:00
if ( unlikely ( should_fail_request ( p , bio - > bi_iter . bi_size ) ) )
2021-01-24 13:02:35 +03:00
return - EIO ;
2019-11-11 05:39:25 +03:00
if ( bio_sectors ( bio ) ) {
2020-11-24 11:36:54 +03:00
bio - > bi_iter . bi_sector + = p - > bd_start_sect ;
2020-12-03 19:21:38 +03:00
trace_block_bio_remap ( bio , p - > bd_dev ,
2020-11-24 11:34:24 +03:00
bio - > bi_iter . bi_sector -
2020-11-24 11:36:54 +03:00
p - > bd_start_sect ) ;
2018-03-14 18:56:53 +03:00
}
2021-01-24 13:02:36 +03:00
bio_set_flag ( bio , BIO_REMAPPED ) ;
2021-01-24 13:02:35 +03:00
return 0 ;
2017-08-23 20:10:32 +03:00
}
2020-05-12 11:55:47 +03:00
/*
* Check write append to a zoned block device .
*/
static inline blk_status_t blk_check_zone_append ( struct request_queue * q ,
struct bio * bio )
{
int nr_sectors = bio_sectors ( bio ) ;
/* Only applicable to zoned block devices */
2022-07-06 10:03:37 +03:00
if ( ! bdev_is_zoned ( bio - > bi_bdev ) )
2020-05-12 11:55:47 +03:00
return BLK_STS_NOTSUPP ;
/* The bio sector must point to the start of a sequential zone */
2022-07-06 10:03:39 +03:00
if ( bio - > bi_iter . bi_sector & ( bdev_zone_sectors ( bio - > bi_bdev ) - 1 ) | |
! bio_zone_is_seq ( bio ) )
2020-05-12 11:55:47 +03:00
return BLK_STS_IOERR ;
/*
* Not allowed to cross zone boundaries . Otherwise , the BIO will be
* split and could result in non - contiguous sectors being written in
* different zones .
*/
if ( nr_sectors > q - > limits . chunk_sectors )
return BLK_STS_IOERR ;
/* Make sure the BIO is small enough and will not get split */
if ( nr_sectors > q - > limits . max_zone_append_sectors )
return BLK_STS_IOERR ;
bio - > bi_opf | = REQ_NOMERGE ;
return BLK_STS_OK ;
}
2021-11-03 14:47:09 +03:00
static void __submit_bio ( struct bio * bio )
{
struct gendisk * disk = bio - > bi_bdev - > bd_disk ;
2021-09-29 10:12:37 +03:00
2022-02-16 07:45:08 +03:00
if ( unlikely ( ! blk_crypto_bio_prep ( & bio ) ) )
return ;
if ( ! disk - > fops - > submit_bio ) {
2021-10-12 14:12:24 +03:00
blk_mq_submit_bio ( bio ) ;
2022-02-16 07:45:08 +03:00
} else if ( likely ( bio_queue_enter ( bio ) = = 0 ) ) {
disk - > fops - > submit_bio ( bio ) ;
blk_queue_exit ( disk - > queue ) ;
}
2020-05-16 21:28:01 +03:00
}
2020-07-01 11:59:45 +03:00
/*
* The loop in this function may be a bit non - obvious , and so deserves some
* explanation :
*
* - Before entering the loop , bio - > bi_next is NULL ( as all callers ensure
* that ) , so we have a list with a single bio .
* - We pretend that we have just taken it off a longer list , so we assign
* bio_list to a pointer to the bio_list_on_stack , thus initialising the
* bio_list of new bios to be added . - > submit_bio ( ) may indeed add some more
* bios through a recursive call to submit_bio_noacct . If it did , we find a
* non - NULL value in bio_list and re - enter the loop from the top .
* - In this case we really did just take the bio of the top of the list ( no
* pretending ) and so remove it from bio_list , and call into - > submit_bio ( )
* again .
*
* bio_list_on_stack [ 0 ] contains bios submitted by the current - > submit_bio .
* bio_list_on_stack [ 1 ] contains bios that were submitted before the current
2022-03-05 05:08:03 +03:00
* - > submit_bio , but that haven ' t been processed yet .
2020-07-01 11:59:45 +03:00
*/
2021-10-12 14:12:24 +03:00
static void __submit_bio_noacct ( struct bio * bio )
2020-07-01 11:59:45 +03:00
{
struct bio_list bio_list_on_stack [ 2 ] ;
BUG_ON ( bio - > bi_next ) ;
bio_list_init ( & bio_list_on_stack [ 0 ] ) ;
current - > bio_list = bio_list_on_stack ;
do {
2021-10-14 17:03:29 +03:00
struct request_queue * q = bdev_get_queue ( bio - > bi_bdev ) ;
2020-07-01 11:59:45 +03:00
struct bio_list lower , same ;
/*
* Create a fresh bio_list for all subordinate requests .
*/
bio_list_on_stack [ 1 ] = bio_list_on_stack [ 0 ] ;
bio_list_init ( & bio_list_on_stack [ 0 ] ) ;
2021-10-12 14:12:24 +03:00
__submit_bio ( bio ) ;
2020-07-01 11:59:45 +03:00
/*
* Sort new bios into those for a lower level and those for the
* same level .
*/
bio_list_init ( & lower ) ;
bio_list_init ( & same ) ;
while ( ( bio = bio_list_pop ( & bio_list_on_stack [ 0 ] ) ) ! = NULL )
2021-10-14 17:03:29 +03:00
if ( q = = bdev_get_queue ( bio - > bi_bdev ) )
2020-07-01 11:59:45 +03:00
bio_list_add ( & same , bio ) ;
else
bio_list_add ( & lower , bio ) ;
/*
* Now assemble so we handle the lowest level first .
*/
bio_list_merge ( & bio_list_on_stack [ 0 ] , & lower ) ;
bio_list_merge ( & bio_list_on_stack [ 0 ] , & same ) ;
bio_list_merge ( & bio_list_on_stack [ 0 ] , & bio_list_on_stack [ 1 ] ) ;
} while ( ( bio = bio_list_pop ( & bio_list_on_stack [ 0 ] ) ) ) ;
current - > bio_list = NULL ;
}
2021-10-12 14:12:24 +03:00
static void __submit_bio_noacct_mq ( struct bio * bio )
2020-07-01 11:59:46 +03:00
{
2020-07-02 22:21:25 +03:00
struct bio_list bio_list [ 2 ] = { } ;
2020-07-01 11:59:46 +03:00
2020-07-02 22:21:25 +03:00
current - > bio_list = bio_list ;
2020-07-01 11:59:46 +03:00
do {
2021-10-12 14:12:24 +03:00
__submit_bio ( bio ) ;
2020-07-02 22:21:25 +03:00
} while ( ( bio = bio_list_pop ( & bio_list [ 0 ] ) ) ) ;
2020-07-01 11:59:46 +03:00
current - > bio_list = NULL ;
}
2022-02-16 07:45:10 +03:00
void submit_bio_noacct_nocheck ( struct bio * bio )
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
{
2011-09-15 16:01:40 +04:00
/*
2020-07-01 11:59:45 +03:00
* We only want one - > submit_bio to be active at a time , else stack
* usage with stacked devices could be a problem . Use current - > bio_list
* to collect a list of requests submited by a - > submit_bio method while
* it is active , and then process them after it returned .
2011-09-15 16:01:40 +04:00
*/
2021-10-12 14:12:24 +03:00
if ( current - > bio_list )
2017-03-10 09:00:47 +03:00
bio_list_add ( & current - > bio_list [ 0 ] , bio ) ;
2021-10-12 14:12:24 +03:00
else if ( ! bio - > bi_bdev - > bd_disk - > fops - > submit_bio )
__submit_bio_noacct_mq ( bio ) ;
else
__submit_bio_noacct ( bio ) ;
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
}
2022-02-16 07:45:10 +03:00
/**
* submit_bio_noacct - re - submit a bio to the block device layer for I / O
* @ bio : The bio describing the location in memory and on the device .
*
* This is a version of submit_bio ( ) that shall only be used for I / O that is
* resubmitted to lower level drivers by stacking block drivers . All file
* systems and other upper level users of the block layer should use
* submit_bio ( ) instead .
*/
void submit_bio_noacct ( struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2021-01-24 13:02:34 +03:00
struct block_device * bdev = bio - > bi_bdev ;
2021-10-14 17:03:29 +03:00
struct request_queue * q = bdev_get_queue ( bdev ) ;
2017-06-03 10:38:06 +03:00
blk_status_t status = BLK_STS_IOERR ;
2020-06-04 20:23:39 +03:00
struct blk_plug * plug ;
2005-04-17 02:20:36 +04:00
might_sleep ( ) ;
2022-07-06 10:03:38 +03:00
plug = blk_mq_plug ( bio ) ;
2020-06-04 20:23:39 +03:00
if ( plug & & plug - > nowait )
bio - > bi_opf | = REQ_NOWAIT ;
2017-06-20 15:05:46 +03:00
/*
2020-05-28 22:19:29 +03:00
* For a REQ_NOWAIT based request , return - EOPNOTSUPP
2020-09-23 23:06:51 +03:00
* if queue does not support NOWAIT .
2017-06-20 15:05:46 +03:00
*/
2022-09-27 10:58:15 +03:00
if ( ( bio - > bi_opf & REQ_NOWAIT ) & & ! bdev_nowait ( bdev ) )
2020-05-28 22:19:29 +03:00
goto not_supported ;
2017-06-20 15:05:46 +03:00
2018-02-07 01:05:39 +03:00
if ( should_fail_bio ( bio ) )
2011-09-12 14:12:01 +04:00
goto end_io ;
2022-09-05 13:27:54 +03:00
bio_check_ro ( bio ) ;
2021-01-25 21:39:57 +03:00
if ( ! bio_flagged ( bio , BIO_REMAPPED ) ) {
if ( unlikely ( bio_check_eod ( bio ) ) )
goto end_io ;
if ( bdev - > bd_partno & & unlikely ( blk_partition_remap ( bio ) ) )
goto end_io ;
}
2006-03-23 22:00:26 +03:00
2011-09-12 14:12:01 +04:00
/*
2020-07-01 11:59:44 +03:00
* Filter flush bio ' s early so that bio based drivers without flush
* support don ' t have to worry about them .
2011-09-12 14:12:01 +04:00
*/
2017-01-27 19:08:23 +03:00
if ( op_is_flush ( bio - > bi_opf ) & &
2016-04-13 22:33:19 +03:00
! test_bit ( QUEUE_FLAG_WC , & q - > queue_flags ) ) {
2016-08-06 00:35:16 +03:00
bio - > bi_opf & = ~ ( REQ_PREFLUSH | REQ_FUA ) ;
2020-07-01 11:59:42 +03:00
if ( ! bio_sectors ( bio ) ) {
2017-06-03 10:38:06 +03:00
status = BLK_STS_OK ;
2007-11-02 10:49:08 +03:00
goto end_io ;
}
2011-09-12 14:12:01 +04:00
}
2006-10-31 09:07:21 +03:00
2018-12-14 19:21:22 +03:00
if ( ! test_bit ( QUEUE_FLAG_POLL , & q - > queue_flags ) )
2021-10-12 14:12:21 +03:00
bio_clear_polled ( bio ) ;
2018-12-14 19:21:22 +03:00
2016-06-09 17:00:36 +03:00
switch ( bio_op ( bio ) ) {
case REQ_OP_DISCARD :
2022-04-15 07:52:55 +03:00
if ( ! bdev_max_discard_sectors ( bdev ) )
2016-06-09 17:00:36 +03:00
goto not_supported ;
break ;
case REQ_OP_SECURE_ERASE :
2022-04-15 07:52:57 +03:00
if ( ! bdev_max_secure_erase_sectors ( bdev ) )
2016-06-09 17:00:36 +03:00
goto not_supported ;
break ;
2020-05-12 11:55:47 +03:00
case REQ_OP_ZONE_APPEND :
status = blk_check_zone_append ( q , bio ) ;
if ( status ! = BLK_STS_OK )
goto end_io ;
break ;
2016-10-18 09:40:32 +03:00
case REQ_OP_ZONE_RESET :
2019-10-27 17:05:45 +03:00
case REQ_OP_ZONE_OPEN :
case REQ_OP_ZONE_CLOSE :
case REQ_OP_ZONE_FINISH :
2022-07-06 10:03:37 +03:00
if ( ! bdev_is_zoned ( bio - > bi_bdev ) )
2016-10-18 09:40:32 +03:00
goto not_supported ;
2016-06-09 17:00:36 +03:00
break ;
2019-08-01 20:26:36 +03:00
case REQ_OP_ZONE_RESET_ALL :
2022-07-06 10:03:37 +03:00
if ( ! bdev_is_zoned ( bio - > bi_bdev ) | | ! blk_queue_zone_resetall ( q ) )
2019-08-01 20:26:36 +03:00
goto not_supported ;
break ;
2016-11-30 23:28:59 +03:00
case REQ_OP_WRITE_ZEROES :
2017-08-23 20:10:32 +03:00
if ( ! q - > limits . max_write_zeroes_sectors )
2016-11-30 23:28:59 +03:00
goto not_supported ;
break ;
2016-06-09 17:00:36 +03:00
default :
break ;
2011-09-12 14:12:01 +04:00
}
2009-09-08 23:56:38 +04:00
2021-11-12 12:33:54 +03:00
if ( blk_throtl_bio ( bio ) )
2022-02-16 07:45:10 +03:00
return ;
2020-06-27 10:31:58 +03:00
blk_cgroup_bio_start ( bio ) ;
blkcg_bio_issue_init ( bio ) ;
2011-09-15 16:01:40 +04:00
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 18:40:52 +03:00
if ( ! bio_flagged ( bio , BIO_TRACE_COMPLETION ) ) {
2020-12-03 19:21:36 +03:00
trace_block_bio_queue ( bio ) ;
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 18:40:52 +03:00
/* Now that enqueuing has been traced, we need to trace
* completion as well .
*/
bio_set_flag ( bio , BIO_TRACE_COMPLETION ) ;
}
2022-02-16 07:45:10 +03:00
submit_bio_noacct_nocheck ( bio ) ;
2022-02-16 07:45:11 +03:00
return ;
2008-11-28 07:32:03 +03:00
2016-06-09 17:00:36 +03:00
not_supported :
2017-06-03 10:38:06 +03:00
status = BLK_STS_NOTSUPP ;
2008-11-28 07:32:03 +03:00
end_io :
2017-06-03 10:38:06 +03:00
bio - > bi_status = status ;
2015-07-20 16:29:37 +03:00
bio_endio ( bio ) ;
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
}
2020-07-01 11:59:44 +03:00
EXPORT_SYMBOL ( submit_bio_noacct ) ;
2005-04-17 02:20:36 +04:00
/**
2008-08-19 22:13:11 +04:00
* submit_bio - submit a bio to the block device layer for I / O
2005-04-17 02:20:36 +04:00
* @ bio : The & struct bio which describes the I / O
*
2020-04-28 14:27:53 +03:00
* submit_bio ( ) is used to submit I / O requests to block devices . It is passed a
* fully set up & struct bio that describes the I / O that needs to be done . The
2021-01-24 13:02:34 +03:00
* bio will be send to the device described by the bi_bdev field .
2005-04-17 02:20:36 +04:00
*
2020-04-28 14:27:53 +03:00
* The success / failure status of the request , along with notification of
* completion , is delivered asynchronously through the - > bi_end_io ( ) callback
2022-09-14 10:42:37 +03:00
* in @ bio . The bio must NOT be touched by the caller until - > bi_end_io ( ) has
2020-04-28 14:27:53 +03:00
* been called .
2005-04-17 02:20:36 +04:00
*/
2021-10-12 14:12:24 +03:00
void submit_bio ( struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2019-06-27 23:39:52 +03:00
if ( blkcg_punt_bio_submit ( bio ) )
2021-10-12 14:12:24 +03:00
return ;
2019-06-27 23:39:52 +03:00
2022-05-16 09:36:54 +03:00
if ( bio_op ( bio ) = = REQ_OP_READ ) {
task_io_account_read ( bio - > bi_iter . bi_size ) ;
count_vm_events ( PGPGIN , bio_sectors ( bio ) ) ;
} else if ( bio_op ( bio ) = = REQ_OP_WRITE ) {
count_vm_events ( PGPGOUT , bio_sectors ( bio ) ) ;
2005-04-17 02:20:36 +04:00
}
2021-10-12 14:12:24 +03:00
submit_bio_noacct ( bio ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( submit_bio ) ;
2021-10-12 14:12:24 +03:00
/**
* bio_poll - poll for BIO completions
* @ bio : bio to poll for
2021-11-25 19:20:55 +03:00
* @ iob : batches of IO
2021-10-12 14:12:24 +03:00
* @ flags : BLK_POLL_ * flags that control the behavior
*
* Poll for completions on queue associated with the bio . Returns number of
* completed entries found .
*
* Note : the caller must either be the context that submitted @ bio , or
* be in a RCU critical section to prevent freeing of @ bio .
*/
2021-10-12 18:24:29 +03:00
int bio_poll ( struct bio * bio , struct io_comp_batch * iob , unsigned int flags )
2021-10-12 14:12:24 +03:00
{
2021-10-20 00:24:11 +03:00
struct request_queue * q = bdev_get_queue ( bio - > bi_bdev ) ;
2021-10-12 14:12:24 +03:00
blk_qc_t cookie = READ_ONCE ( bio - > bi_cookie ) ;
2022-03-05 05:08:03 +03:00
int ret = 0 ;
2021-10-12 14:12:24 +03:00
if ( cookie = = BLK_QC_T_NONE | |
! test_bit ( QUEUE_FLAG_POLL , & q - > queue_flags ) )
return 0 ;
2022-09-29 17:41:41 +03:00
/*
* As the requests that require a zone lock are not plugged in the
* first place , directly accessing the plug instead of using
* blk_mq_plug ( ) should not have any consequences during flushing for
* zoned devices .
*/
2022-01-27 10:05:49 +03:00
blk_flush_plug ( current - > plug , false ) ;
2021-10-12 14:12:24 +03:00
2022-05-23 15:43:02 +03:00
if ( bio_queue_enter ( bio ) )
2021-10-12 14:12:24 +03:00
return 0 ;
2022-03-05 05:08:03 +03:00
if ( queue_is_mq ( q ) ) {
2021-10-12 18:24:29 +03:00
ret = blk_mq_poll ( q , cookie , iob , flags ) ;
2022-03-05 05:08:03 +03:00
} else {
struct gendisk * disk = q - > disk ;
if ( disk & & disk - > fops - > poll_bio )
ret = disk - > fops - > poll_bio ( bio , iob , flags ) ;
}
2021-10-12 14:12:24 +03:00
blk_queue_exit ( q ) ;
return ret ;
}
EXPORT_SYMBOL_GPL ( bio_poll ) ;
/*
* Helper to implement file_operations . iopoll . Requires the bio to be stored
* in iocb - > private , and cleared before freeing the bio .
*/
2021-10-12 18:24:29 +03:00
int iocb_bio_iopoll ( struct kiocb * kiocb , struct io_comp_batch * iob ,
unsigned int flags )
2021-10-12 14:12:24 +03:00
{
struct bio * bio ;
int ret = 0 ;
/*
* Note : the bio cache only uses SLAB_TYPESAFE_BY_RCU , so bio can
* point to a freshly allocated bio at this point . If that happens
* we have a few cases to consider :
*
* 1 ) the bio is beeing initialized and bi_bdev is NULL . We can just
* simply nothing in this case
* 2 ) the bio points to a not poll enabled device . bio_poll will catch
* this and return 0
* 3 ) the bio points to a poll capable device , including but not
* limited to the one that the original bio pointed to . In this
* case we will call into the actual poll method and poll for I / O ,
* even if we don ' t need to , but it won ' t cause harm either .
*
* For cases 2 ) and 3 ) above the RCU grace period ensures that bi_bdev
* is still allocated . Because partitions hold a reference to the whole
* device bdev and thus disk , the disk is also still valid . Grabbing
* a reference to the queue in bio_poll ( ) ensures the hctxs and requests
* are still valid as well .
*/
rcu_read_lock ( ) ;
bio = READ_ONCE ( kiocb - > private ) ;
if ( bio & & bio - > bi_bdev )
2021-10-12 18:24:29 +03:00
ret = bio_poll ( bio , iob , flags ) ;
2021-10-12 14:12:24 +03:00
rcu_read_unlock ( ) ;
return ret ;
}
EXPORT_SYMBOL_GPL ( iocb_bio_iopoll ) ;
2021-11-17 09:14:01 +03:00
void update_io_ticks ( struct block_device * part , unsigned long now , bool end )
2020-05-27 08:24:13 +03:00
{
unsigned long stamp ;
again :
2020-11-24 11:36:54 +03:00
stamp = READ_ONCE ( part - > bd_stamp ) ;
2021-07-06 00:47:26 +03:00
if ( unlikely ( time_after ( now , stamp ) ) ) {
2022-07-12 18:27:41 +03:00
if ( likely ( try_cmpxchg ( & part - > bd_stamp , & stamp , now ) ) )
2020-05-27 08:24:13 +03:00
__part_stat_add ( part , io_ticks , end ? now - stamp : 1 ) ;
}
2020-11-24 11:36:54 +03:00
if ( part - > bd_partno ) {
part = bdev_whole ( part ) ;
2020-05-27 08:24:13 +03:00
goto again ;
}
}
2022-04-18 05:27:13 +03:00
unsigned long bdev_start_io_acct ( struct block_device * bdev ,
2022-07-14 21:06:28 +03:00
unsigned int sectors , enum req_op op ,
2022-04-18 05:27:13 +03:00
unsigned long start_time )
2020-05-27 08:24:04 +03:00
{
const int sgrp = op_stat_group ( op ) ;
part_stat_lock ( ) ;
2022-04-18 05:27:13 +03:00
update_io_ticks ( bdev , start_time , false ) ;
part_stat_inc ( bdev , ios [ sgrp ] ) ;
part_stat_add ( bdev , sectors [ sgrp ] , sectors ) ;
part_stat_local_inc ( bdev , in_flight [ op_is_write ( op ) ] ) ;
2020-05-27 08:24:04 +03:00
part_stat_unlock ( ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
2022-01-28 18:58:39 +03:00
return start_time ;
}
2022-04-18 05:27:13 +03:00
EXPORT_SYMBOL ( bdev_start_io_acct ) ;
2022-01-28 18:58:39 +03:00
2021-01-24 13:02:37 +03:00
/**
* bio_start_io_acct - start I / O accounting for bio based drivers
* @ bio : bio to start account for
*
* Returns the start time that should be passed back to bio_end_io_acct ( ) .
*/
unsigned long bio_start_io_acct ( struct bio * bio )
2020-09-01 01:27:23 +03:00
{
2022-04-18 05:27:13 +03:00
return bdev_start_io_acct ( bio - > bi_bdev , bio_sectors ( bio ) ,
bio_op ( bio ) , jiffies ) ;
2020-09-01 01:27:23 +03:00
}
2021-01-24 13:02:37 +03:00
EXPORT_SYMBOL_GPL ( bio_start_io_acct ) ;
2020-09-01 01:27:23 +03:00
2022-07-14 21:06:28 +03:00
void bdev_end_io_acct ( struct block_device * bdev , enum req_op op ,
2022-04-18 05:27:13 +03:00
unsigned long start_time )
2020-05-27 08:24:04 +03:00
{
const int sgrp = op_stat_group ( op ) ;
unsigned long now = READ_ONCE ( jiffies ) ;
unsigned long duration = now - start_time ;
2018-12-06 19:41:19 +03:00
2020-05-27 08:24:04 +03:00
part_stat_lock ( ) ;
2022-04-18 05:27:13 +03:00
update_io_ticks ( bdev , now , true ) ;
part_stat_add ( bdev , nsecs [ sgrp ] , jiffies_to_nsecs ( duration ) ) ;
part_stat_local_dec ( bdev , in_flight [ op_is_write ( op ) ] ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
part_stat_unlock ( ) ;
}
2022-04-18 05:27:13 +03:00
EXPORT_SYMBOL ( bdev_end_io_acct ) ;
2020-09-01 01:27:23 +03:00
2021-01-24 13:02:37 +03:00
void bio_end_io_acct_remapped ( struct bio * bio , unsigned long start_time ,
2022-04-18 05:27:13 +03:00
struct block_device * orig_bdev )
2020-09-01 01:27:23 +03:00
{
2022-04-18 05:27:13 +03:00
bdev_end_io_acct ( orig_bdev , bio_op ( bio ) , start_time ) ;
2020-09-01 01:27:23 +03:00
}
2021-01-24 13:02:37 +03:00
EXPORT_SYMBOL_GPL ( bio_end_io_acct_remapped ) ;
2020-09-01 01:27:23 +03:00
2008-10-01 18:12:15 +04:00
/**
* blk_lld_busy - Check if underlying low - level drivers of a device are busy
* @ q : the queue of the device being checked
*
* Description :
* Check if underlying low - level drivers of a device are busy .
* If the drivers want to export their busy state , they must set own
* exporting function using blk_queue_lld_busy ( ) first .
*
* Basically , this function is used only by request stacking drivers
* to stop dispatching requests to underlying devices when underlying
* devices are busy . This behavior helps more I / O merging on the queue
* of the request stacking driver and prevents I / O throughput regression
* on burst I / O load .
*
* Return :
* 0 - Not busy ( The request stacking driver should dispatch request )
* 1 - Busy ( The request stacking driver should stop dispatching request )
*/
int blk_lld_busy ( struct request_queue * q )
{
2018-11-15 22:22:51 +03:00
if ( queue_is_mq ( q ) & & q - > mq_ops - > busy )
2018-10-29 19:15:10 +03:00
return q - > mq_ops - > busy ( q ) ;
2008-10-01 18:12:15 +04:00
return 0 ;
}
EXPORT_SYMBOL_GPL ( blk_lld_busy ) ;
2014-04-08 19:15:35 +04:00
int kblockd_schedule_work ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
return queue_work ( kblockd_workqueue , work ) ;
}
EXPORT_SYMBOL ( kblockd_schedule_work ) ;
2017-04-10 18:54:55 +03:00
int kblockd_mod_delayed_work_on ( int cpu , struct delayed_work * dwork ,
unsigned long delay )
{
return mod_delayed_work_on ( cpu , kblockd_workqueue , dwork , delay ) ;
}
EXPORT_SYMBOL ( kblockd_mod_delayed_work_on ) ;
2021-10-06 15:34:11 +03:00
void blk_start_plug_nr_ios ( struct blk_plug * plug , unsigned short nr_ios )
{
struct task_struct * tsk = current ;
/*
* If this is a nested plug , don ' t actually assign it .
*/
if ( tsk - > plug )
return ;
2021-10-18 19:12:12 +03:00
plug - > mq_list = NULL ;
2021-10-06 15:34:11 +03:00
plug - > cached_rq = NULL ;
plug - > nr_ios = min_t ( unsigned short , nr_ios , BLK_MAX_REQUEST_COUNT ) ;
plug - > rq_count = 0 ;
plug - > multiple_queues = false ;
2021-10-19 15:02:30 +03:00
plug - > has_elevator = false ;
2021-10-06 15:34:11 +03:00
plug - > nowait = false ;
INIT_LIST_HEAD ( & plug - > cb_list ) ;
/*
* Store ordering should not be needed here , since a potential
* preempt will imply a full memory barrier
*/
tsk - > plug = plug ;
}
2011-09-21 12:00:16 +04:00
/**
* blk_start_plug - initialize blk_plug and track it inside the task_struct
* @ plug : The & struct blk_plug that needs to be initialized
*
* Description :
2019-01-09 00:57:34 +03:00
* blk_start_plug ( ) indicates to the block layer an intent by the caller
* to submit multiple I / O requests in a batch . The block layer may use
* this hint to defer submitting I / Os from the caller until blk_finish_plug ( )
* is called . However , the block layer may choose to submit requests
* before a call to blk_finish_plug ( ) if the number of queued I / Os
* exceeds % BLK_MAX_REQUEST_COUNT , or if the size of the I / O is larger than
* % BLK_PLUG_FLUSH_SIZE . The queued I / Os may also be submitted early if
* the task schedules ( see below ) .
*
2011-09-21 12:00:16 +04:00
* Tracking blk_plug inside the task_struct will help with auto - flushing the
* pending I / O should the task end up blocking between blk_start_plug ( ) and
* blk_finish_plug ( ) . This is important from a performance perspective , but
* also ensures that we don ' t deadlock . For instance , if the task is blocking
* for a memory allocation , memory reclaim could end up wanting to free a
* page belonging to that request that is currently residing in our private
* plug . By flushing the pending I / O when the process goes to sleep , we avoid
* this kind of deadlock .
*/
2011-03-08 15:19:51 +03:00
void blk_start_plug ( struct blk_plug * plug )
{
2021-10-06 15:34:11 +03:00
blk_start_plug_nr_ios ( plug , 1 ) ;
2011-03-08 15:19:51 +03:00
}
EXPORT_SYMBOL ( blk_start_plug ) ;
2012-07-31 11:08:15 +04:00
static void flush_plug_callbacks ( struct blk_plug * plug , bool from_schedule )
2011-04-18 11:52:22 +04:00
{
LIST_HEAD ( callbacks ) ;
2012-07-31 11:08:15 +04:00
while ( ! list_empty ( & plug - > cb_list ) ) {
list_splice_init ( & plug - > cb_list , & callbacks ) ;
2011-04-18 11:52:22 +04:00
2012-07-31 11:08:15 +04:00
while ( ! list_empty ( & callbacks ) ) {
struct blk_plug_cb * cb = list_first_entry ( & callbacks ,
2011-04-18 11:52:22 +04:00
struct blk_plug_cb ,
list ) ;
2012-07-31 11:08:15 +04:00
list_del ( & cb - > list ) ;
2012-07-31 11:08:15 +04:00
cb - > callback ( cb , from_schedule ) ;
2012-07-31 11:08:15 +04:00
}
2011-04-18 11:52:22 +04:00
}
}
2012-07-31 11:08:14 +04:00
struct blk_plug_cb * blk_check_plugged ( blk_plug_cb_fn unplug , void * data ,
int size )
{
struct blk_plug * plug = current - > plug ;
struct blk_plug_cb * cb ;
if ( ! plug )
return NULL ;
list_for_each_entry ( cb , & plug - > cb_list , list )
if ( cb - > callback = = unplug & & cb - > data = = data )
return cb ;
/* Not currently on the callback list */
BUG_ON ( size < sizeof ( * cb ) ) ;
cb = kzalloc ( size , GFP_ATOMIC ) ;
if ( cb ) {
cb - > data = data ;
cb - > callback = unplug ;
list_add ( & cb - > list , & plug - > cb_list ) ;
}
return cb ;
}
EXPORT_SYMBOL ( blk_check_plugged ) ;
2022-01-27 10:05:49 +03:00
void __blk_flush_plug ( struct blk_plug * plug , bool from_schedule )
2011-03-08 15:19:51 +03:00
{
2021-10-20 17:41:18 +03:00
if ( ! list_empty ( & plug - > cb_list ) )
flush_plug_callbacks ( plug , from_schedule ) ;
2021-10-18 19:12:12 +03:00
if ( ! rq_list_empty ( plug - > mq_list ) )
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
blk_mq_flush_plug_list ( plug , from_schedule ) ;
2021-11-03 14:49:07 +03:00
/*
* Unconditionally flush out cached requests , even if the unplug
* event came from schedule . Since we know hold references to the
* queue for cached requests , we don ' t want a blocked task holding
* up a queue freeze / quiesce event .
*/
if ( unlikely ( ! rq_list_empty ( plug - > cached_rq ) ) )
2021-10-06 15:34:11 +03:00
blk_mq_free_plug_rqs ( plug ) ;
2011-03-08 15:19:51 +03:00
}
2019-01-09 00:57:34 +03:00
/**
* blk_finish_plug - mark the end of a batch of submitted I / O
* @ plug : The & struct blk_plug passed to blk_start_plug ( )
*
* Description :
* Indicate that a batch of I / O submissions is complete . This function
* must be paired with an initial call to blk_start_plug ( ) . The intent
* is to allow the block layer to optimize I / O submission . See the
* documentation for blk_start_plug ( ) for more information .
*/
2011-03-08 15:19:51 +03:00
void blk_finish_plug ( struct blk_plug * plug )
{
2021-10-20 17:41:19 +03:00
if ( plug = = current - > plug ) {
2022-01-27 10:05:49 +03:00
__blk_flush_plug ( plug , false ) ;
2021-10-20 17:41:19 +03:00
current - > plug = NULL ;
}
2011-03-08 15:19:51 +03:00
}
2011-04-15 17:20:10 +04:00
EXPORT_SYMBOL ( blk_finish_plug ) ;
2011-03-08 15:19:51 +03:00
2020-05-14 11:45:09 +03:00
void blk_io_schedule ( void )
{
/* Prevent hang_check timer from firing at us during very long I/O */
unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2 ;
if ( timeout )
io_schedule_timeout ( timeout ) ;
else
io_schedule ( ) ;
}
EXPORT_SYMBOL_GPL ( blk_io_schedule ) ;
2005-04-17 02:20:36 +04:00
int __init blk_dev_init ( void )
{
2022-07-14 21:06:32 +03:00
BUILD_BUG_ON ( ( __force u32 ) REQ_OP_LAST > = ( 1 < < REQ_OP_BITS ) ) ;
2016-10-28 17:48:16 +03:00
BUILD_BUG_ON ( REQ_OP_BITS + REQ_FLAG_BITS > 8 *
2019-12-09 21:31:43 +03:00
sizeof_field ( struct request , cmd_flags ) ) ;
2016-10-28 17:48:16 +03:00
BUILD_BUG_ON ( REQ_OP_BITS + REQ_FLAG_BITS > 8 *
2019-12-09 21:31:43 +03:00
sizeof_field ( struct bio , bi_opf ) ) ;
2009-04-27 16:53:54 +04:00
2011-01-03 17:01:47 +03:00
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
kblockd_workqueue = alloc_workqueue ( " kblockd " ,
2014-06-12 01:43:54 +04:00
WQ_MEM_RECLAIM | WQ_HIGHPRI , 0 ) ;
2005-04-17 02:20:36 +04:00
if ( ! kblockd_workqueue )
panic ( " Failed to create kblockd \n " ) ;
2015-11-21 00:16:46 +03:00
blk_requestq_cachep = kmem_cache_create ( " request_queue " ,
2007-07-24 11:28:11 +04:00
sizeof ( struct request_queue ) , 0 , SLAB_PANIC , NULL ) ;
2005-04-17 02:20:36 +04:00
2017-02-01 01:53:20 +03:00
blk_debugfs_root = debugfs_create_dir ( " block " , NULL ) ;
2008-01-24 10:53:35 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}