2005-04-17 02:20:36 +04:00
/*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
* Copyright ( C ) 1994 , Karl Keyte : Added support for disk statistics
* Elevator latency , ( C ) 2000 Andrea Arcangeli < andrea @ suse . de > SuSE
* Queue request tables / lock , selectable elevator , Jens Axboe < axboe @ suse . de >
2008-01-31 15:03:55 +03:00
* kernel - doc documentation started by NeilBrown < neilb @ cse . unsw . edu . au >
* - July2000
2005-04-17 02:20:36 +04:00
* bio rewrite , highmem i / o , etc , Jens Axboe < axboe @ suse . de > - may 2001
*/
/*
* This handles all read / write requests to block devices
*/
# include <linux/kernel.h>
# include <linux/module.h>
# include <linux/backing-dev.h>
# include <linux/bio.h>
# include <linux/blkdev.h>
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
# include <linux/blk-mq.h>
2005-04-17 02:20:36 +04:00
# include <linux/highmem.h>
# include <linux/mm.h>
# include <linux/kernel_stat.h>
# include <linux/string.h>
# include <linux/init.h>
# include <linux/completion.h>
# include <linux/slab.h>
# include <linux/swap.h>
# include <linux/writeback.h>
2006-12-10 13:19:35 +03:00
# include <linux/task_io_accounting_ops.h>
2006-12-08 13:39:46 +03:00
# include <linux/fault-inject.h>
2011-03-08 15:19:51 +03:00
# include <linux/list_sort.h>
2011-10-19 16:32:38 +04:00
# include <linux/delay.h>
2012-04-20 03:29:22 +04:00
# include <linux/ratelimit.h>
2013-03-23 07:42:26 +04:00
# include <linux/pm_runtime.h>
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 09:43:05 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/block.h>
2005-04-17 02:20:36 +04:00
2008-01-29 16:51:59 +03:00
# include "blk.h"
2012-03-06 01:15:12 +04:00
# include "blk-cgroup.h"
2013-12-26 17:31:35 +04:00
# include "blk-mq.h"
2008-01-29 16:51:59 +03:00
2010-11-16 14:52:38 +03:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_bio_remap ) ;
2009-10-01 23:16:13 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_rq_remap ) ;
2013-04-18 20:00:26 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_bio_complete ) ;
2014-04-28 22:30:52 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_split ) ;
2012-12-14 23:49:27 +04:00
EXPORT_TRACEPOINT_SYMBOL_GPL ( block_unplug ) ;
2008-11-26 13:59:56 +03:00
2011-12-14 03:33:37 +04:00
DEFINE_IDA ( blk_queue_ida ) ;
2005-04-17 02:20:36 +04:00
/*
* For the allocated request tables
*/
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
struct kmem_cache * request_cachep = NULL ;
2005-04-17 02:20:36 +04:00
/*
* For queue allocation
*/
2008-01-31 15:03:55 +03:00
struct kmem_cache * blk_requestq_cachep ;
2005-04-17 02:20:36 +04:00
/*
* Controlling structure to kblockd
*/
2006-01-09 18:02:34 +03:00
static struct workqueue_struct * kblockd_workqueue ;
2005-04-17 02:20:36 +04:00
2008-01-29 16:51:59 +03:00
void blk_queue_congestion_threshold ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
int nr ;
nr = q - > nr_requests - ( q - > nr_requests / 8 ) + 1 ;
if ( nr > q - > nr_requests )
nr = q - > nr_requests ;
q - > nr_congestion_on = nr ;
nr = q - > nr_requests - ( q - > nr_requests / 8 ) - ( q - > nr_requests / 16 ) - 1 ;
if ( nr < 1 )
nr = 1 ;
q - > nr_congestion_off = nr ;
}
/**
* blk_get_backing_dev_info - get the address of a queue ' s backing_dev_info
* @ bdev : device
*
* Locates the passed device ' s request queue and returns the address of its
2014-09-08 03:03:56 +04:00
* backing_dev_info . This function can only be called if @ bdev is opened
* and the return value is never NULL .
2005-04-17 02:20:36 +04:00
*/
struct backing_dev_info * blk_get_backing_dev_info ( struct block_device * bdev )
{
2007-07-24 11:28:11 +04:00
struct request_queue * q = bdev_get_queue ( bdev ) ;
2005-04-17 02:20:36 +04:00
2014-09-08 03:03:56 +04:00
return & q - > backing_dev_info ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_get_backing_dev_info ) ;
2008-04-29 11:54:36 +04:00
void blk_rq_init ( struct request_queue * q , struct request * rq )
2005-04-17 02:20:36 +04:00
{
2008-04-25 14:26:28 +04:00
memset ( rq , 0 , sizeof ( * rq ) ) ;
2005-04-17 02:20:36 +04:00
INIT_LIST_HEAD ( & rq - > queuelist ) ;
2008-09-14 16:55:09 +04:00
INIT_LIST_HEAD ( & rq - > timeout_list ) ;
2008-09-13 22:26:01 +04:00
rq - > cpu = - 1 ;
2008-02-08 14:41:03 +03:00
rq - > q = q ;
2009-05-07 17:24:44 +04:00
rq - > __sector = ( sector_t ) - 1 ;
2006-07-13 13:55:04 +04:00
INIT_HLIST_NODE ( & rq - > hash ) ;
RB_CLEAR_NODE ( & rq - > rb_node ) ;
2008-04-29 11:54:39 +04:00
rq - > cmd = rq - > __cmd ;
2009-04-02 09:43:26 +04:00
rq - > cmd_len = BLK_MAX_CDB ;
2008-02-08 14:41:03 +03:00
rq - > tag = - 1 ;
2009-04-23 06:05:18 +04:00
rq - > start_time = jiffies ;
2010-04-02 02:01:41 +04:00
set_start_time_ns ( rq ) ;
2011-01-05 18:57:38 +03:00
rq - > part = NULL ;
2005-04-17 02:20:36 +04:00
}
2008-04-29 11:54:36 +04:00
EXPORT_SYMBOL ( blk_rq_init ) ;
2005-04-17 02:20:36 +04:00
2007-09-27 14:46:13 +04:00
static void req_bio_endio ( struct request * rq , struct bio * bio ,
unsigned int nbytes , int error )
2005-04-17 02:20:36 +04:00
{
2011-01-25 14:43:52 +03:00
if ( error )
clear_bit ( BIO_UPTODATE , & bio - > bi_flags ) ;
else if ( ! test_bit ( BIO_UPTODATE , & bio - > bi_flags ) )
error = - EIO ;
2006-01-06 11:51:03 +03:00
2011-01-25 14:43:52 +03:00
if ( unlikely ( rq - > cmd_flags & REQ_QUIET ) )
set_bit ( BIO_QUIET , & bio - > bi_flags ) ;
block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-25 12:24:35 +03:00
2012-09-21 03:38:30 +04:00
bio_advance ( bio , nbytes ) ;
2008-06-30 22:04:41 +04:00
2011-01-25 14:43:52 +03:00
/* don't actually finish bio if it's part of flush sequence */
2013-10-12 02:44:27 +04:00
if ( bio - > bi_iter . bi_size = = 0 & & ! ( rq - > cmd_flags & REQ_FLUSH_SEQ ) )
2011-01-25 14:43:52 +03:00
bio_endio ( bio , error ) ;
2005-04-17 02:20:36 +04:00
}
void blk_dump_rq_flags ( struct request * rq , char * msg )
{
int bit ;
2013-05-23 14:25:08 +04:00
printk ( KERN_INFO " %s: dev %s: type=%x, flags=%llx \n " , msg ,
2006-08-10 10:44:47 +04:00
rq - > rq_disk ? rq - > rq_disk - > disk_name : " ? " , rq - > cmd_type ,
2013-05-23 14:25:08 +04:00
( unsigned long long ) rq - > cmd_flags ) ;
2005-04-17 02:20:36 +04:00
2009-05-07 17:24:39 +04:00
printk ( KERN_INFO " sector %llu, nr/cnr %u/%u \n " ,
( unsigned long long ) blk_rq_pos ( rq ) ,
blk_rq_sectors ( rq ) , blk_rq_cur_sectors ( rq ) ) ;
2014-04-10 19:46:28 +04:00
printk ( KERN_INFO " bio %p, biotail %p, len %u \n " ,
rq - > bio , rq - > biotail , blk_rq_bytes ( rq ) ) ;
2005-04-17 02:20:36 +04:00
2010-08-07 20:17:56 +04:00
if ( rq - > cmd_type = = REQ_TYPE_BLOCK_PC ) {
2008-01-31 15:03:55 +03:00
printk ( KERN_INFO " cdb: " ) ;
2008-04-29 16:37:52 +04:00
for ( bit = 0 ; bit < BLK_MAX_CDB ; bit + + )
2005-04-17 02:20:36 +04:00
printk ( " %02x " , rq - > cmd [ bit ] ) ;
printk ( " \n " ) ;
}
}
EXPORT_SYMBOL ( blk_dump_rq_flags ) ;
2011-03-02 19:08:00 +03:00
static void blk_delay_work ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
2011-03-02 19:08:00 +03:00
struct request_queue * q ;
2005-04-17 02:20:36 +04:00
2011-03-02 19:08:00 +03:00
q = container_of ( work , struct request_queue , delay_work . work ) ;
spin_lock_irq ( q - > queue_lock ) ;
2011-04-18 13:41:33 +04:00
__blk_run_queue ( q ) ;
2011-03-02 19:08:00 +03:00
spin_unlock_irq ( q - > queue_lock ) ;
2005-04-17 02:20:36 +04:00
}
/**
2011-03-02 19:08:00 +03:00
* blk_delay_queue - restart queueing after defined interval
* @ q : The & struct request_queue in question
* @ msecs : Delay in msecs
2005-04-17 02:20:36 +04:00
*
* Description :
2011-03-02 19:08:00 +03:00
* Sometimes queueing needs to be postponed for a little while , to allow
* resources to come back . This function will make sure that queueing is
2012-11-28 16:45:56 +04:00
* restarted around the specified time . Queue lock must be held .
2011-03-02 19:08:00 +03:00
*/
void blk_delay_queue ( struct request_queue * q , unsigned long msecs )
2007-11-07 22:26:56 +03:00
{
2012-11-28 16:45:56 +04:00
if ( likely ( ! blk_queue_dead ( q ) ) )
queue_delayed_work ( kblockd_workqueue , & q - > delay_work ,
msecs_to_jiffies ( msecs ) ) ;
2007-11-07 22:26:56 +03:00
}
2011-03-02 19:08:00 +03:00
EXPORT_SYMBOL ( blk_delay_queue ) ;
2007-11-07 22:26:56 +03:00
2005-04-17 02:20:36 +04:00
/**
* blk_start_queue - restart a previously stopped queue
2007-07-24 11:28:11 +04:00
* @ q : The & struct request_queue in question
2005-04-17 02:20:36 +04:00
*
* Description :
* blk_start_queue ( ) will clear the stop flag on the queue , and call
* the request_fn for the queue if it was in a stopped state when
* entered . Also see blk_stop_queue ( ) . Queue lock must be held .
* */
2007-07-24 11:28:11 +04:00
void blk_start_queue ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
2006-06-05 14:09:01 +04:00
WARN_ON ( ! irqs_disabled ( ) ) ;
2008-04-29 16:48:33 +04:00
queue_flag_clear ( QUEUE_FLAG_STOPPED , q ) ;
2011-04-18 13:41:33 +04:00
__blk_run_queue ( q ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_start_queue ) ;
/**
* blk_stop_queue - stop a queue
2007-07-24 11:28:11 +04:00
* @ q : The & struct request_queue in question
2005-04-17 02:20:36 +04:00
*
* Description :
* The Linux block layer assumes that a block driver will consume all
* entries on the request queue when the request_fn strategy is called .
* Often this will not happen , because of hardware limitations ( queue
* depth settings ) . If a device driver gets a ' queue full ' response ,
* or if it simply chooses not to queue more I / O at one point , it can
* call this function to prevent the request_fn from being called until
* the driver has signalled it ' s ready to go again . This happens by calling
* blk_start_queue ( ) to restart queue operations . Queue lock must be held .
* */
2007-07-24 11:28:11 +04:00
void blk_stop_queue ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
2012-08-22 00:18:24 +04:00
cancel_delayed_work ( & q - > delay_work ) ;
2008-04-29 16:48:33 +04:00
queue_flag_set ( QUEUE_FLAG_STOPPED , q ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_stop_queue ) ;
/**
* blk_sync_queue - cancel any pending callbacks on a queue
* @ q : the queue
*
* Description :
* The block layer may perform asynchronous callback activity
* on a queue , such as calling the unplug function after a timeout .
* A block device may call blk_sync_queue to ensure that any
* such activity is cancelled , thus allowing it to release resources
2007-05-09 10:57:56 +04:00
* that the callbacks might use . The caller must already have made sure
2005-04-17 02:20:36 +04:00
* that its - > make_request_fn will not re - add plugging prior to calling
* this function .
*
2011-03-03 03:05:33 +03:00
* This function does not cancel any asynchronous activity arising
* out of elevator or throttling code . That would require elevaotor_exit ( )
2012-03-06 01:15:12 +04:00
* and blkcg_exit_queue ( ) to be called with queue lock initialized .
2011-03-03 03:05:33 +03:00
*
2005-04-17 02:20:36 +04:00
*/
void blk_sync_queue ( struct request_queue * q )
{
2008-11-19 16:38:39 +03:00
del_timer_sync ( & q - > timeout ) ;
2013-12-26 17:31:36 +04:00
if ( q - > mq_ops ) {
struct blk_mq_hw_ctx * hctx ;
int i ;
2014-04-16 20:48:08 +04:00
queue_for_each_hw_ctx ( q , hctx , i ) {
cancel_delayed_work_sync ( & hctx - > run_work ) ;
cancel_delayed_work_sync ( & hctx - > delay_work ) ;
}
2013-12-26 17:31:36 +04:00
} else {
cancel_delayed_work_sync ( & q - > delay_work ) ;
}
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_sync_queue ) ;
2012-12-06 17:32:01 +04:00
/**
* __blk_run_queue_uncond - run a queue whether or not it has been stopped
* @ q : The queue to run
*
* Description :
* Invoke request handling on a queue if there are any pending requests .
* May be used to restart request handling after a request has completed .
* This variant runs the queue whether or not the queue has been
* stopped . Must be called with the queue lock held and interrupts
* disabled . See also @ blk_run_queue .
*/
inline void __blk_run_queue_uncond ( struct request_queue * q )
{
if ( unlikely ( blk_queue_dead ( q ) ) )
return ;
2012-11-28 16:46:45 +04:00
/*
* Some request_fn implementations , e . g . scsi_request_fn ( ) , unlock
* the queue lock internally . As a result multiple threads may be
* running such a request function concurrently . Keep track of the
* number of active request_fn invocations such that blk_drain_queue ( )
* can wait until all these request_fn calls have finished .
*/
q - > request_fn_active + + ;
2012-12-06 17:32:01 +04:00
q - > request_fn ( q ) ;
2012-11-28 16:46:45 +04:00
q - > request_fn_active - - ;
2012-12-06 17:32:01 +04:00
}
2005-04-17 02:20:36 +04:00
/**
2008-10-14 11:51:06 +04:00
* __blk_run_queue - run a single device queue
2005-04-17 02:20:36 +04:00
* @ q : The queue to run
2008-10-14 11:51:06 +04:00
*
* Description :
* See @ blk_run_queue . This variant must be called with the queue lock
2011-04-18 13:41:33 +04:00
* held and interrupts disabled .
2005-04-17 02:20:36 +04:00
*/
2011-04-18 13:41:33 +04:00
void __blk_run_queue ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
2009-04-23 06:05:17 +04:00
if ( unlikely ( blk_queue_stopped ( q ) ) )
return ;
2012-12-06 17:32:01 +04:00
__blk_run_queue_uncond ( q ) ;
2008-04-29 16:48:33 +04:00
}
EXPORT_SYMBOL ( __blk_run_queue ) ;
2006-05-11 10:20:16 +04:00
2011-04-18 13:41:33 +04:00
/**
* blk_run_queue_async - run a single device queue in workqueue context
* @ q : The queue to run
*
* Description :
* Tells kblockd to perform the equivalent of @ blk_run_queue on behalf
2012-11-28 16:45:56 +04:00
* of us . The caller must hold the queue lock .
2011-04-18 13:41:33 +04:00
*/
void blk_run_queue_async ( struct request_queue * q )
{
2012-11-28 16:45:56 +04:00
if ( likely ( ! blk_queue_stopped ( q ) & & ! blk_queue_dead ( q ) ) )
2012-08-22 00:18:24 +04:00
mod_delayed_work ( kblockd_workqueue , & q - > delay_work , 0 ) ;
2011-04-18 13:41:33 +04:00
}
2011-04-19 15:32:46 +04:00
EXPORT_SYMBOL ( blk_run_queue_async ) ;
2011-04-18 13:41:33 +04:00
2008-04-29 16:48:33 +04:00
/**
* blk_run_queue - run a single device queue
* @ q : The queue to run
2008-10-14 11:51:06 +04:00
*
* Description :
* Invoke request handling on this queue , if it has pending work to do .
2009-04-23 06:05:17 +04:00
* May be used to restart queueing when a request has completed .
2008-04-29 16:48:33 +04:00
*/
void blk_run_queue ( struct request_queue * q )
{
unsigned long flags ;
spin_lock_irqsave ( q - > queue_lock , flags ) ;
2011-04-18 13:41:33 +04:00
__blk_run_queue ( q ) ;
2005-04-17 02:20:36 +04:00
spin_unlock_irqrestore ( q - > queue_lock , flags ) ;
}
EXPORT_SYMBOL ( blk_run_queue ) ;
2007-07-24 11:28:11 +04:00
void blk_put_queue ( struct request_queue * q )
2006-03-19 02:34:37 +03:00
{
kobject_put ( & q - > kobj ) ;
}
2011-05-27 09:44:43 +04:00
EXPORT_SYMBOL ( blk_put_queue ) ;
2006-03-19 02:34:37 +03:00
2011-10-19 16:32:38 +04:00
/**
2012-11-28 16:43:38 +04:00
* __blk_drain_queue - drain requests from request_queue
2011-10-19 16:32:38 +04:00
* @ q : queue to drain
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
* @ drain_all : whether to drain all requests or only the ones w / ELVPRIV
2011-10-19 16:32:38 +04:00
*
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
* Drain requests from @ q . If @ drain_all is set , all requests are drained .
* If not , only ELVPRIV requests are drained . The caller is responsible
* for ensuring that no new requests which need to be drained are queued .
2011-10-19 16:32:38 +04:00
*/
2012-11-28 16:43:38 +04:00
static void __blk_drain_queue ( struct request_queue * q , bool drain_all )
__releases ( q - > queue_lock )
__acquires ( q - > queue_lock )
2011-10-19 16:32:38 +04:00
{
2012-06-15 10:45:25 +04:00
int i ;
2012-11-28 16:43:38 +04:00
lockdep_assert_held ( q - > queue_lock ) ;
2011-10-19 16:32:38 +04:00
while ( true ) {
2011-12-14 03:33:37 +04:00
bool drain = false ;
2011-10-19 16:32:38 +04:00
2012-03-07 00:24:55 +04:00
/*
* The caller might be trying to drain @ q before its
* elevator is initialized .
*/
if ( q - > elevator )
elv_drain_elevator ( q ) ;
2012-03-06 01:15:12 +04:00
blkcg_drain_queue ( q ) ;
2011-10-19 16:32:38 +04:00
2011-12-15 23:03:04 +04:00
/*
* This function might be called on a queue which failed
2012-03-07 00:24:55 +04:00
* driver init after queue creation or is not yet fully
* active yet . Some drivers ( e . g . fd and loop ) get unhappy
* in such cases . Kick queue iff dispatch queue has
* something on it and @ q has request_fn set .
2011-12-15 23:03:04 +04:00
*/
2012-03-07 00:24:55 +04:00
if ( ! list_empty ( & q - > queue_head ) & & q - > request_fn )
2011-12-15 23:03:04 +04:00
__blk_run_queue ( q ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
2012-06-05 07:40:58 +04:00
drain | = q - > nr_rqs_elvpriv ;
2012-11-28 16:46:45 +04:00
drain | = q - > request_fn_active ;
2011-12-14 03:33:37 +04:00
/*
* Unfortunately , requests are queued at and tracked from
* multiple places and there ' s no single counter which can
* be drained . Check all the queues and counters .
*/
if ( drain_all ) {
2014-09-25 19:23:46 +04:00
struct blk_flush_queue * fq = blk_get_flush_queue ( q , NULL ) ;
2011-12-14 03:33:37 +04:00
drain | = ! list_empty ( & q - > queue_head ) ;
for ( i = 0 ; i < 2 ; i + + ) {
2012-06-05 07:40:58 +04:00
drain | = q - > nr_rqs [ i ] ;
2011-12-14 03:33:37 +04:00
drain | = q - > in_flight [ i ] ;
2014-09-25 19:23:43 +04:00
if ( fq )
drain | = ! list_empty ( & fq - > flush_queue [ i ] ) ;
2011-12-14 03:33:37 +04:00
}
}
2011-10-19 16:32:38 +04:00
2011-12-14 03:33:37 +04:00
if ( ! drain )
2011-10-19 16:32:38 +04:00
break ;
2012-11-28 16:43:38 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
2011-10-19 16:32:38 +04:00
msleep ( 10 ) ;
2012-11-28 16:43:38 +04:00
spin_lock_irq ( q - > queue_lock ) ;
2011-10-19 16:32:38 +04:00
}
2012-06-15 10:45:25 +04:00
/*
* With queue marked dead , any woken up waiter will fail the
* allocation path , so the wakeup chaining is lost and we ' re
* left with hung waiters . We need to wake up those waiters .
*/
if ( q - > request_fn ) {
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
struct request_list * rl ;
blk_queue_for_each_rl ( rl , q )
for ( i = 0 ; i < ARRAY_SIZE ( rl - > wait ) ; i + + )
wake_up_all ( & rl - > wait [ i ] ) ;
2012-06-15 10:45:25 +04:00
}
2011-10-19 16:32:38 +04:00
}
2012-03-06 01:14:58 +04:00
/**
* blk_queue_bypass_start - enter queue bypass mode
* @ q : queue of interest
*
* In bypass mode , only the dispatch FIFO queue of @ q is used . This
* function makes @ q enter bypass mode and drains all requests which were
2012-03-06 01:14:59 +04:00
* throttled or issued before . On return , it ' s guaranteed that no request
2012-04-14 01:50:53 +04:00
* is being throttled or has ELVPRIV set and blk_queue_bypass ( ) % true
* inside queue or RCU read lock .
2012-03-06 01:14:58 +04:00
*/
void blk_queue_bypass_start ( struct request_queue * q )
{
spin_lock_irq ( q - > queue_lock ) ;
2014-07-01 20:29:17 +04:00
q - > bypass_depth + + ;
2012-03-06 01:14:58 +04:00
queue_flag_set ( QUEUE_FLAG_BYPASS , q ) ;
spin_unlock_irq ( q - > queue_lock ) ;
2014-07-01 20:29:17 +04:00
/*
* Queues start drained . Skip actual draining till init is
* complete . This avoids lenghty delays during queue init which
* can happen many times during boot .
*/
if ( blk_queue_init_done ( q ) ) {
2012-11-28 16:43:38 +04:00
spin_lock_irq ( q - > queue_lock ) ;
__blk_drain_queue ( q , false ) ;
spin_unlock_irq ( q - > queue_lock ) ;
2012-04-14 00:11:31 +04:00
/* ensure blk_queue_bypass() is %true inside RCU read lock */
synchronize_rcu ( ) ;
}
2012-03-06 01:14:58 +04:00
}
EXPORT_SYMBOL_GPL ( blk_queue_bypass_start ) ;
/**
* blk_queue_bypass_end - leave queue bypass mode
* @ q : queue of interest
*
* Leave bypass mode and restore the normal queueing behavior .
*/
void blk_queue_bypass_end ( struct request_queue * q )
{
spin_lock_irq ( q - > queue_lock ) ;
if ( ! - - q - > bypass_depth )
queue_flag_clear ( QUEUE_FLAG_BYPASS , q ) ;
WARN_ON_ONCE ( q - > bypass_depth < 0 ) ;
spin_unlock_irq ( q - > queue_lock ) ;
}
EXPORT_SYMBOL_GPL ( blk_queue_bypass_end ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
/**
* blk_cleanup_queue - shutdown a request queue
* @ q : request queue to shutdown
*
2012-12-06 17:32:01 +04:00
* Mark @ q DYING , drain all pending requests , mark @ q DEAD , destroy and
* put it . All future requests will be failed immediately with - ENODEV .
2011-03-03 03:04:42 +03:00
*/
2008-01-31 15:03:55 +03:00
void blk_cleanup_queue ( struct request_queue * q )
2006-03-19 02:34:37 +03:00
{
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
spinlock_t * lock = q - > queue_lock ;
2008-09-18 20:22:54 +04:00
2012-11-28 16:42:38 +04:00
/* mark @q DYING, no new request or merges will be allowed afterwards */
2006-03-19 02:34:37 +03:00
mutex_lock ( & q - > sysfs_lock ) ;
2012-11-28 16:42:38 +04:00
queue_flag_set_unlocked ( QUEUE_FLAG_DYING , q ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
spin_lock_irq ( lock ) ;
2012-03-06 01:14:59 +04:00
2012-04-14 01:50:53 +04:00
/*
2012-11-28 16:42:38 +04:00
* A dying queue is permanently in bypass mode till released . Note
2012-04-14 01:50:53 +04:00
* that , unlike blk_queue_bypass_start ( ) , we aren ' t performing
* synchronize_rcu ( ) after entering bypass mode to avoid the delay
* as some drivers create and destroy a lot of queues while
* probing . This is still safe because blk_release_queue ( ) will be
* called only after the queue refcnt drops to zero and nothing ,
* RCU or not , would be traversing the queue by then .
*/
2012-03-06 01:14:59 +04:00
q - > bypass_depth + + ;
queue_flag_set ( QUEUE_FLAG_BYPASS , q ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
queue_flag_set ( QUEUE_FLAG_NOMERGES , q ) ;
queue_flag_set ( QUEUE_FLAG_NOXMERGES , q ) ;
2012-11-28 16:42:38 +04:00
queue_flag_set ( QUEUE_FLAG_DYING , q ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
spin_unlock_irq ( lock ) ;
mutex_unlock ( & q - > sysfs_lock ) ;
2012-12-06 17:32:01 +04:00
/*
* Drain all requests queued before DYING marking . Set DEAD flag to
* prevent that q - > request_fn ( ) gets invoked after draining finished .
*/
2013-12-26 17:31:35 +04:00
if ( q - > mq_ops ) {
2014-07-01 20:31:13 +04:00
blk_mq_freeze_queue ( q ) ;
2013-12-26 17:31:35 +04:00
spin_lock_irq ( lock ) ;
} else {
spin_lock_irq ( lock ) ;
__blk_drain_queue ( q , true ) ;
}
2012-12-06 17:32:01 +04:00
queue_flag_set ( QUEUE_FLAG_DEAD , q ) ;
2012-11-28 16:43:38 +04:00
spin_unlock_irq ( lock ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
/* @q won't process any more request, flush async actions */
del_timer_sync ( & q - > backing_dev_info . laptop_mode_wb_timer ) ;
blk_sync_queue ( q ) ;
2012-05-24 19:28:52 +04:00
spin_lock_irq ( lock ) ;
if ( q - > queue_lock ! = & q - > __queue_lock )
q - > queue_lock = & q - > __queue_lock ;
spin_unlock_irq ( lock ) ;
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 16:42:16 +04:00
/* @q is and will stay empty, shutdown and put */
2006-03-19 02:34:37 +03:00
blk_put_queue ( q ) ;
}
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( blk_cleanup_queue ) ;
2012-06-05 07:40:59 +04:00
int blk_init_rl ( struct request_list * rl , struct request_queue * q ,
gfp_t gfp_mask )
2005-04-17 02:20:36 +04:00
{
2010-05-25 21:15:15 +04:00
if ( unlikely ( rl - > rq_pool ) )
return 0 ;
2012-06-05 07:40:59 +04:00
rl - > q = q ;
2009-04-06 16:48:01 +04:00
rl - > count [ BLK_RW_SYNC ] = rl - > count [ BLK_RW_ASYNC ] = 0 ;
rl - > starved [ BLK_RW_SYNC ] = rl - > starved [ BLK_RW_ASYNC ] = 0 ;
init_waitqueue_head ( & rl - > wait [ BLK_RW_SYNC ] ) ;
init_waitqueue_head ( & rl - > wait [ BLK_RW_ASYNC ] ) ;
2005-04-17 02:20:36 +04:00
2005-06-23 11:08:19 +04:00
rl - > rq_pool = mempool_create_node ( BLKDEV_MIN_RQ , mempool_alloc_slab ,
2012-06-05 07:40:53 +04:00
mempool_free_slab , request_cachep ,
2012-06-05 07:40:59 +04:00
gfp_mask , q - > node ) ;
2005-04-17 02:20:36 +04:00
if ( ! rl - > rq_pool )
return - ENOMEM ;
return 0 ;
}
2012-06-05 07:40:59 +04:00
void blk_exit_rl ( struct request_list * rl )
{
if ( rl - > rq_pool )
mempool_destroy ( rl - > rq_pool ) ;
}
2007-07-24 11:28:11 +04:00
struct request_queue * blk_alloc_queue ( gfp_t gfp_mask )
2005-04-17 02:20:36 +04:00
{
2012-11-10 13:39:44 +04:00
return blk_alloc_queue_node ( gfp_mask , NUMA_NO_NODE ) ;
2005-06-23 11:08:19 +04:00
}
EXPORT_SYMBOL ( blk_alloc_queue ) ;
2005-04-17 02:20:36 +04:00
2007-07-24 11:28:11 +04:00
struct request_queue * blk_alloc_queue_node ( gfp_t gfp_mask , int node_id )
2005-06-23 11:08:19 +04:00
{
2007-07-24 11:28:11 +04:00
struct request_queue * q ;
2007-10-17 10:25:46 +04:00
int err ;
2005-06-23 11:08:19 +04:00
2008-01-29 16:51:59 +03:00
q = kmem_cache_alloc_node ( blk_requestq_cachep ,
2007-07-17 15:03:29 +04:00
gfp_mask | __GFP_ZERO , node_id ) ;
2005-04-17 02:20:36 +04:00
if ( ! q )
return NULL ;
2012-03-23 12:58:54 +04:00
q - > id = ida_simple_get ( & blk_queue_ida , 0 , 0 , gfp_mask ) ;
2011-12-14 03:33:37 +04:00
if ( q - > id < 0 )
2014-05-27 19:35:14 +04:00
goto fail_q ;
2011-12-14 03:33:37 +04:00
2009-06-12 16:42:56 +04:00
q - > backing_dev_info . ra_pages =
( VM_MAX_READAHEAD * 1024 ) / PAGE_CACHE_SIZE ;
q - > backing_dev_info . state = 0 ;
q - > backing_dev_info . capabilities = BDI_CAP_MAP_COPY ;
2009-06-12 16:45:52 +04:00
q - > backing_dev_info . name = " block " ;
2011-11-23 13:59:13 +04:00
q - > node = node_id ;
2009-06-12 16:42:56 +04:00
2007-10-17 10:25:46 +04:00
err = bdi_init ( & q - > backing_dev_info ) ;
2011-12-14 03:33:37 +04:00
if ( err )
goto fail_id ;
2007-10-17 10:25:46 +04:00
2010-04-06 16:25:14 +04:00
setup_timer ( & q - > backing_dev_info . laptop_mode_wb_timer ,
laptop_mode_timer_fn , ( unsigned long ) q ) ;
2008-09-14 16:55:09 +04:00
setup_timer ( & q - > timeout , blk_rq_timed_out_timer , ( unsigned long ) q ) ;
2012-03-07 00:24:55 +04:00
INIT_LIST_HEAD ( & q - > queue_head ) ;
2008-09-14 16:55:09 +04:00
INIT_LIST_HEAD ( & q - > timeout_list ) ;
2011-12-14 03:33:41 +04:00
INIT_LIST_HEAD ( & q - > icq_list ) ;
2012-03-06 01:15:18 +04:00
# ifdef CONFIG_BLK_CGROUP
2012-03-06 01:15:20 +04:00
INIT_LIST_HEAD ( & q - > blkg_list ) ;
2012-03-06 01:15:18 +04:00
# endif
2011-03-02 19:08:00 +03:00
INIT_DELAYED_WORK ( & q - > delay_work , blk_delay_work ) ;
2006-03-19 02:34:37 +03:00
2008-01-29 16:51:59 +03:00
kobject_init ( & q - > kobj , & blk_queue_ktype ) ;
2005-04-17 02:20:36 +04:00
2006-03-19 02:34:37 +03:00
mutex_init ( & q - > sysfs_lock ) ;
2008-05-15 03:05:54 +04:00
spin_lock_init ( & q - > __queue_lock ) ;
2006-03-19 02:34:37 +03:00
2011-03-03 03:04:42 +03:00
/*
* By default initialize queue_lock to internal lock and driver can
* override it later if need be .
*/
q - > queue_lock = & q - > __queue_lock ;
2012-04-14 00:11:31 +04:00
/*
* A queue starts its life with bypass turned on to avoid
* unnecessary bypass on / off overhead and nasty surprises during
2012-09-21 01:08:52 +04:00
* init . The initial bypass will be finished when the queue is
* registered by blk_register_queue ( ) .
2012-04-14 00:11:31 +04:00
*/
q - > bypass_depth = 1 ;
__set_bit ( QUEUE_FLAG_BYPASS , & q - > queue_flags ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
init_waitqueue_head ( & q - > mq_freeze_wq ) ;
2012-03-06 01:15:12 +04:00
if ( blkcg_init_queue ( q ) )
2013-10-14 20:11:36 +04:00
goto fail_bdi ;
2012-03-06 01:15:05 +04:00
2005-04-17 02:20:36 +04:00
return q ;
2011-12-14 03:33:37 +04:00
2013-10-14 20:11:36 +04:00
fail_bdi :
bdi_destroy ( & q - > backing_dev_info ) ;
2011-12-14 03:33:37 +04:00
fail_id :
ida_simple_remove ( & blk_queue_ida , q - > id ) ;
fail_q :
kmem_cache_free ( blk_requestq_cachep , q ) ;
return NULL ;
2005-04-17 02:20:36 +04:00
}
2005-06-23 11:08:19 +04:00
EXPORT_SYMBOL ( blk_alloc_queue_node ) ;
2005-04-17 02:20:36 +04:00
/**
* blk_init_queue - prepare a request queue for use with a block device
* @ rfn : The function to be called to process requests that have been
* placed on the queue .
* @ lock : Request queue spin lock
*
* Description :
* If a block device wishes to use the standard request handling procedures ,
* which sorts requests and coalesces adjacent requests , then it must
* call blk_init_queue ( ) . The function @ rfn will be called when there
* are requests on the queue that need to be processed . If the device
* supports plugging , then @ rfn may not be called immediately when requests
* are available on the queue , but may be called at some time later instead .
* Plugged queues are generally unplugged when a buffer belonging to one
* of the requests on the queue is needed , or due to memory pressure .
*
* @ rfn is not required , or even expected , to remove all requests off the
* queue , but only as many as it can handle at a time . If it does leave
* requests on the queue , it is responsible for arranging that the requests
* get dealt with eventually .
*
* The queue spin lock must be held while manipulating the requests on the
2006-06-05 14:09:01 +04:00
* request queue ; this lock will be taken also from interrupt context , so irq
* disabling is needed for it .
2005-04-17 02:20:36 +04:00
*
2008-08-19 22:13:11 +04:00
* Function returns a pointer to the initialized request queue , or % NULL if
2005-04-17 02:20:36 +04:00
* it didn ' t succeed .
*
* Note :
* blk_init_queue ( ) must be paired with a blk_cleanup_queue ( ) call
* when the block device is deactivated ( such as at module unload ) .
* */
2005-06-23 11:08:19 +04:00
2007-07-24 11:28:11 +04:00
struct request_queue * blk_init_queue ( request_fn_proc * rfn , spinlock_t * lock )
2005-04-17 02:20:36 +04:00
{
2012-11-10 13:39:44 +04:00
return blk_init_queue_node ( rfn , lock , NUMA_NO_NODE ) ;
2005-06-23 11:08:19 +04:00
}
EXPORT_SYMBOL ( blk_init_queue ) ;
2007-07-24 11:28:11 +04:00
struct request_queue *
2005-06-23 11:08:19 +04:00
blk_init_queue_node ( request_fn_proc * rfn , spinlock_t * lock , int node_id )
{
2010-06-03 21:34:52 +04:00
struct request_queue * uninit_q , * q ;
2005-04-17 02:20:36 +04:00
2010-06-03 21:34:52 +04:00
uninit_q = blk_alloc_queue_node ( GFP_KERNEL , node_id ) ;
if ( ! uninit_q )
return NULL ;
2011-11-23 13:59:13 +04:00
q = blk_init_allocated_queue ( uninit_q , rfn , lock ) ;
2010-06-03 21:34:52 +04:00
if ( ! q )
2014-03-09 04:20:01 +04:00
blk_cleanup_queue ( uninit_q ) ;
2014-02-10 20:29:00 +04:00
2014-03-09 04:20:01 +04:00
return q ;
2010-05-11 10:57:42 +04:00
}
EXPORT_SYMBOL ( blk_init_queue_node ) ;
struct request_queue *
blk_init_allocated_queue ( struct request_queue * q , request_fn_proc * rfn ,
spinlock_t * lock )
{
2005-04-17 02:20:36 +04:00
if ( ! q )
return NULL ;
2014-09-25 19:23:47 +04:00
q - > fq = blk_alloc_flush_queue ( q , NUMA_NO_NODE , 0 ) ;
2014-09-25 19:23:44 +04:00
if ( ! q - > fq )
2014-03-09 04:20:01 +04:00
return NULL ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
if ( blk_init_rl ( & q - > root_rl , q , GFP_KERNEL ) )
2014-03-21 01:03:58 +04:00
goto fail ;
2005-04-17 02:20:36 +04:00
q - > request_fn = rfn ;
q - > prep_rq_fn = NULL ;
2010-07-01 14:49:17 +04:00
q - > unprep_rq_fn = NULL ;
2012-09-21 01:09:30 +04:00
q - > queue_flags | = QUEUE_FLAG_DEFAULT ;
2011-03-03 03:04:42 +03:00
/* Override internal queue lock with supplied lock pointer */
if ( lock )
q - > queue_lock = lock ;
2005-04-17 02:20:36 +04:00
2009-03-06 10:48:33 +03:00
/*
* This also sets hw / phys segments , boundary and size
*/
2011-09-12 14:03:37 +04:00
blk_queue_make_request ( q , blk_queue_bio ) ;
2005-04-17 02:20:36 +04:00
2007-02-20 19:01:57 +03:00
q - > sg_reserved_size = INT_MAX ;
2013-10-16 02:42:16 +04:00
/* Protect q->elevator from elevator_change */
mutex_lock ( & q - > sysfs_lock ) ;
2012-04-14 00:11:31 +04:00
/* init elevator */
2013-10-16 02:42:16 +04:00
if ( elevator_init ( q , NULL ) ) {
mutex_unlock ( & q - > sysfs_lock ) ;
2014-03-21 01:03:58 +04:00
goto fail ;
2013-10-16 02:42:16 +04:00
}
mutex_unlock ( & q - > sysfs_lock ) ;
2012-04-14 00:11:31 +04:00
return q ;
2014-03-21 01:03:58 +04:00
fail :
2014-09-25 19:23:44 +04:00
blk_free_flush_queue ( q - > fq ) ;
2014-03-21 01:03:58 +04:00
return NULL ;
2005-04-17 02:20:36 +04:00
}
2011-11-23 13:59:13 +04:00
EXPORT_SYMBOL ( blk_init_allocated_queue ) ;
2005-04-17 02:20:36 +04:00
2011-12-14 03:33:38 +04:00
bool blk_get_queue ( struct request_queue * q )
2005-04-17 02:20:36 +04:00
{
2012-11-28 16:42:38 +04:00
if ( likely ( ! blk_queue_dying ( q ) ) ) {
2011-12-14 03:33:38 +04:00
__blk_get_queue ( q ) ;
return true ;
2005-04-17 02:20:36 +04:00
}
2011-12-14 03:33:38 +04:00
return false ;
2005-04-17 02:20:36 +04:00
}
2011-05-27 09:44:43 +04:00
EXPORT_SYMBOL ( blk_get_queue ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:59 +04:00
static inline void blk_free_request ( struct request_list * rl , struct request * rq )
2005-04-17 02:20:36 +04:00
{
2011-12-14 03:33:42 +04:00
if ( rq - > cmd_flags & REQ_ELVPRIV ) {
2012-06-05 07:40:59 +04:00
elv_put_request ( rl - > q , rq ) ;
2011-12-14 03:33:42 +04:00
if ( rq - > elv . icq )
2012-02-07 10:51:30 +04:00
put_io_context ( rq - > elv . icq - > ioc ) ;
2011-12-14 03:33:42 +04:00
}
2012-06-05 07:40:59 +04:00
mempool_free ( rq , rl - > rq_pool ) ;
2005-04-17 02:20:36 +04:00
}
/*
* ioc_batching returns true if the ioc is a valid batching request and
* should be given priority access to a request .
*/
2007-07-24 11:28:11 +04:00
static inline int ioc_batching ( struct request_queue * q , struct io_context * ioc )
2005-04-17 02:20:36 +04:00
{
if ( ! ioc )
return 0 ;
/*
* Make sure the process is able to allocate at least 1 request
* even if the batch times out , otherwise we could theoretically
* lose wakeups .
*/
return ioc - > nr_batch_requests = = q - > nr_batching | |
( ioc - > nr_batch_requests > 0
& & time_before ( jiffies , ioc - > last_waited + BLK_BATCH_TIME ) ) ;
}
/*
* ioc_set_batching sets ioc to be a new " batcher " if it is not one . This
* will cause the process to be a " batcher " on all queues in the system . This
* is the behaviour we want though - once it gets a wakeup it should be given
* a nice run .
*/
2007-07-24 11:28:11 +04:00
static void ioc_set_batching ( struct request_queue * q , struct io_context * ioc )
2005-04-17 02:20:36 +04:00
{
if ( ! ioc | | ioc_batching ( q , ioc ) )
return ;
ioc - > nr_batch_requests = q - > nr_batching ;
ioc - > last_waited = jiffies ;
}
2012-06-05 07:40:59 +04:00
static void __freed_request ( struct request_list * rl , int sync )
2005-04-17 02:20:36 +04:00
{
2012-06-05 07:40:59 +04:00
struct request_queue * q = rl - > q ;
2005-04-17 02:20:36 +04:00
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
/*
* bdi isn ' t aware of blkcg yet . As all async IOs end up root
* blkcg anyway , just use root blkcg state .
*/
if ( rl = = & q - > root_rl & &
rl - > count [ sync ] < queue_congestion_off_threshold ( q ) )
2009-04-06 16:48:01 +04:00
blk_clear_queue_congested ( q , sync ) ;
2005-04-17 02:20:36 +04:00
2009-04-06 16:48:01 +04:00
if ( rl - > count [ sync ] + 1 < = q - > nr_requests ) {
if ( waitqueue_active ( & rl - > wait [ sync ] ) )
wake_up ( & rl - > wait [ sync ] ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:59 +04:00
blk_clear_rl_full ( rl , sync ) ;
2005-04-17 02:20:36 +04:00
}
}
/*
* A request has just been released . Account for it , update the full and
* congestion status , wake up any waiters . Called under q - > queue_lock .
*/
2012-06-05 07:40:59 +04:00
static void freed_request ( struct request_list * rl , unsigned int flags )
2005-04-17 02:20:36 +04:00
{
2012-06-05 07:40:59 +04:00
struct request_queue * q = rl - > q ;
2011-10-19 16:31:22 +04:00
int sync = rw_is_sync ( flags ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:58 +04:00
q - > nr_rqs [ sync ] - - ;
2009-04-06 16:48:01 +04:00
rl - > count [ sync ] - - ;
2011-10-19 16:31:22 +04:00
if ( flags & REQ_ELVPRIV )
2012-06-05 07:40:58 +04:00
q - > nr_rqs_elvpriv - - ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:59 +04:00
__freed_request ( rl , sync ) ;
2005-04-17 02:20:36 +04:00
2009-04-06 16:48:01 +04:00
if ( unlikely ( rl - > starved [ sync ^ 1 ] ) )
2012-06-05 07:40:59 +04:00
__freed_request ( rl , sync ^ 1 ) ;
2005-04-17 02:20:36 +04:00
}
2014-05-20 21:49:02 +04:00
int blk_update_nr_requests ( struct request_queue * q , unsigned int nr )
{
struct request_list * rl ;
spin_lock_irq ( q - > queue_lock ) ;
q - > nr_requests = nr ;
blk_queue_congestion_threshold ( q ) ;
/* congestion isn't cgroup aware and follows root blkcg for now */
rl = & q - > root_rl ;
if ( rl - > count [ BLK_RW_SYNC ] > = queue_congestion_on_threshold ( q ) )
blk_set_queue_congested ( q , BLK_RW_SYNC ) ;
else if ( rl - > count [ BLK_RW_SYNC ] < queue_congestion_off_threshold ( q ) )
blk_clear_queue_congested ( q , BLK_RW_SYNC ) ;
if ( rl - > count [ BLK_RW_ASYNC ] > = queue_congestion_on_threshold ( q ) )
blk_set_queue_congested ( q , BLK_RW_ASYNC ) ;
else if ( rl - > count [ BLK_RW_ASYNC ] < queue_congestion_off_threshold ( q ) )
blk_clear_queue_congested ( q , BLK_RW_ASYNC ) ;
blk_queue_for_each_rl ( rl , q ) {
if ( rl - > count [ BLK_RW_SYNC ] > = q - > nr_requests ) {
blk_set_rl_full ( rl , BLK_RW_SYNC ) ;
} else {
blk_clear_rl_full ( rl , BLK_RW_SYNC ) ;
wake_up ( & rl - > wait [ BLK_RW_SYNC ] ) ;
}
if ( rl - > count [ BLK_RW_ASYNC ] > = q - > nr_requests ) {
blk_set_rl_full ( rl , BLK_RW_ASYNC ) ;
} else {
blk_clear_rl_full ( rl , BLK_RW_ASYNC ) ;
wake_up ( & rl - > wait [ BLK_RW_ASYNC ] ) ;
}
}
spin_unlock_irq ( q - > queue_lock ) ;
return 0 ;
}
2011-02-11 13:05:46 +03:00
/*
* Determine if elevator data should be initialized when allocating the
* request associated with @ bio .
*/
static bool blk_rq_should_init_elevator ( struct bio * bio )
{
if ( ! bio )
return true ;
/*
* Flush requests do not use the elevator so skip initialization .
* This allows a request to share the flush and elevator data .
*/
if ( bio - > bi_rw & ( REQ_FLUSH | REQ_FUA ) )
return false ;
return true ;
}
2012-03-06 01:15:27 +04:00
/**
* rq_ioc - determine io_context for request allocation
* @ bio : request being allocated is for this bio ( can be % NULL )
*
* Determine io_context to use for request allocation for @ bio . May return
* % NULL if % current - > io_context doesn ' t exist .
*/
static struct io_context * rq_ioc ( struct bio * bio )
{
# ifdef CONFIG_BLK_CGROUP
if ( bio & & bio - > bi_ioc )
return bio - > bi_ioc ;
# endif
return current - > io_context ;
}
2011-10-19 16:33:05 +04:00
/**
2012-06-05 07:40:55 +04:00
* __get_request - get a free request
2012-06-05 07:40:59 +04:00
* @ rl : request list to allocate from
2011-10-19 16:33:05 +04:00
* @ rw_flags : RW and SYNC flags
* @ bio : bio to allocate request for ( can be % NULL )
* @ gfp_mask : allocation mask
*
* Get a free request from @ q . This function may fail under memory
* pressure or if @ q is dead .
*
2014-08-28 18:15:21 +04:00
* Must be called with @ q - > queue_lock held and ,
* Returns ERR_PTR on failure , with @ q - > queue_lock held .
* Returns request pointer on success , with @ q - > queue_lock * not held * .
2005-04-17 02:20:36 +04:00
*/
2012-06-05 07:40:59 +04:00
static struct request * __get_request ( struct request_list * rl , int rw_flags ,
2012-06-05 07:40:55 +04:00
struct bio * bio , gfp_t gfp_mask )
2005-04-17 02:20:36 +04:00
{
2012-06-05 07:40:59 +04:00
struct request_queue * q = rl - > q ;
2012-03-06 01:15:23 +04:00
struct request * rq ;
2012-06-05 07:40:56 +04:00
struct elevator_type * et = q - > elevator - > type ;
struct io_context * ioc = rq_ioc ( bio ) ;
2011-12-14 03:33:42 +04:00
struct io_cq * icq = NULL ;
2009-04-06 16:48:01 +04:00
const bool is_sync = rw_is_sync ( rw_flags ) ! = 0 ;
2011-10-19 16:31:22 +04:00
int may_queue ;
2005-11-12 13:09:12 +03:00
2012-11-28 16:42:38 +04:00
if ( unlikely ( blk_queue_dying ( q ) ) )
2014-08-28 18:15:21 +04:00
return ERR_PTR ( - ENODEV ) ;
2011-10-19 16:33:05 +04:00
2006-12-13 15:02:26 +03:00
may_queue = elv_may_queue ( q , rw_flags ) ;
2005-11-12 13:09:12 +03:00
if ( may_queue = = ELV_MQUEUE_NO )
goto rq_starved ;
2009-04-06 16:48:01 +04:00
if ( rl - > count [ is_sync ] + 1 > = queue_congestion_on_threshold ( q ) ) {
if ( rl - > count [ is_sync ] + 1 > = q - > nr_requests ) {
2005-11-12 13:09:12 +03:00
/*
* The queue will fill after this allocation , so set
* it as full , and mark this process as " batching " .
* This process will be allowed to complete a batch of
* requests , others will be blocked .
*/
2012-06-05 07:40:59 +04:00
if ( ! blk_rl_full ( rl , is_sync ) ) {
2005-11-12 13:09:12 +03:00
ioc_set_batching ( q , ioc ) ;
2012-06-05 07:40:59 +04:00
blk_set_rl_full ( rl , is_sync ) ;
2005-11-12 13:09:12 +03:00
} else {
if ( may_queue ! = ELV_MQUEUE_MUST
& & ! ioc_batching ( q , ioc ) ) {
/*
* The queue is full and the allocating
* process is not a " batcher " , and not
* exempted by the IO scheduler
*/
2014-08-28 18:15:21 +04:00
return ERR_PTR ( - ENOMEM ) ;
2005-11-12 13:09:12 +03:00
}
}
2005-04-17 02:20:36 +04:00
}
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
/*
* bdi isn ' t aware of blkcg yet . As all async IOs end up
* root blkcg anyway , just use root blkcg state .
*/
if ( rl = = & q - > root_rl )
blk_set_queue_congested ( q , is_sync ) ;
2005-04-17 02:20:36 +04:00
}
2005-06-28 18:35:11 +04:00
/*
* Only allow batching queuers to allocate up to 50 % over the defined
* limit of requests , otherwise we could have thousands of requests
* allocated with any setting of - > nr_requests
*/
2009-04-06 16:48:01 +04:00
if ( rl - > count [ is_sync ] > = ( 3 * q - > nr_requests / 2 ) )
2014-08-28 18:15:21 +04:00
return ERR_PTR ( - ENOMEM ) ;
2005-06-29 18:15:40 +04:00
2012-06-05 07:40:58 +04:00
q - > nr_rqs [ is_sync ] + + ;
2009-04-06 16:48:01 +04:00
rl - > count [ is_sync ] + + ;
rl - > starved [ is_sync ] = 0 ;
2005-10-28 10:29:39 +04:00
2011-12-14 03:33:42 +04:00
/*
* Decide whether the new request will be managed by elevator . If
* so , mark @ rw_flags and increment elvpriv . Non - zero elvpriv will
* prevent the current elevator from being destroyed until the new
* request is freed . This guarantees icq ' s won ' t be destroyed and
* makes creating new ones safe .
*
* Also , lookup icq while holding queue_lock . If it doesn ' t exist ,
* it will be created after releasing queue_lock .
*/
2012-03-06 01:14:58 +04:00
if ( blk_rq_should_init_elevator ( bio ) & & ! blk_queue_bypass ( q ) ) {
2011-10-19 16:31:22 +04:00
rw_flags | = REQ_ELVPRIV ;
2012-06-05 07:40:58 +04:00
q - > nr_rqs_elvpriv + + ;
2011-12-14 03:33:42 +04:00
if ( et - > icq_cache & & ioc )
icq = ioc_lookup_icq ( ioc , q ) ;
2011-02-11 13:05:46 +03:00
}
2005-10-28 10:29:39 +04:00
2010-10-25 00:06:02 +04:00
if ( blk_queue_io_stat ( q ) )
rw_flags | = REQ_IO_STAT ;
2005-04-17 02:20:36 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
2012-04-20 03:29:21 +04:00
/* allocate and init request */
2012-06-05 07:40:59 +04:00
rq = mempool_alloc ( rl - > rq_pool , gfp_mask ) ;
2012-04-20 03:29:21 +04:00
if ( ! rq )
2012-03-06 01:15:23 +04:00
goto fail_alloc ;
2005-04-17 02:20:36 +04:00
2012-04-20 03:29:21 +04:00
blk_rq_init ( q , rq ) ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
blk_rq_set_rl ( rq , rl ) ;
2012-04-20 03:29:21 +04:00
rq - > cmd_flags = rw_flags | REQ_ALLOCED ;
2012-04-20 03:29:22 +04:00
/* init elvpriv */
2012-04-20 03:29:21 +04:00
if ( rw_flags & REQ_ELVPRIV ) {
2012-04-20 03:29:22 +04:00
if ( unlikely ( et - > icq_cache & & ! icq ) ) {
2012-06-05 07:40:56 +04:00
if ( ioc )
icq = ioc_create_icq ( ioc , q , gfp_mask ) ;
2012-04-20 03:29:22 +04:00
if ( ! icq )
goto fail_elvpriv ;
2012-04-20 03:29:21 +04:00
}
2012-04-20 03:29:22 +04:00
rq - > elv . icq = icq ;
if ( unlikely ( elv_set_request ( q , rq , bio , gfp_mask ) ) )
goto fail_elvpriv ;
/* @rq->elv.icq holds io_context until @rq is freed */
2012-04-20 03:29:21 +04:00
if ( icq )
get_io_context ( icq - > ioc ) ;
}
2012-04-20 03:29:22 +04:00
out :
2005-11-12 13:09:12 +03:00
/*
* ioc may be NULL here , and ioc_batching will be false . That ' s
* OK , if the queue is under the request limit then requests need
* not count toward the nr_batch_requests limit . There will always
* be some limit enforced by BLK_BATCH_TIME .
*/
2005-04-17 02:20:36 +04:00
if ( ioc_batching ( q , ioc ) )
ioc - > nr_batch_requests - - ;
2008-01-31 15:03:55 +03:00
2009-04-06 16:48:01 +04:00
trace_block_getrq ( q , bio , rw_flags & 1 ) ;
2005-04-17 02:20:36 +04:00
return rq ;
2012-03-06 01:15:23 +04:00
2012-04-20 03:29:22 +04:00
fail_elvpriv :
/*
* elvpriv init failed . ioc , icq and elvpriv aren ' t mempool backed
* and may fail indefinitely under memory pressure and thus
* shouldn ' t stall IO . Treat this request as ! elvpriv . This will
* disturb iosched and blkcg but weird is bettern than dead .
*/
printk_ratelimited ( KERN_WARNING " %s: request aux data allocation failed, iosched may be disturbed \n " ,
dev_name ( q - > backing_dev_info . dev ) ) ;
rq - > cmd_flags & = ~ REQ_ELVPRIV ;
rq - > elv . icq = NULL ;
spin_lock_irq ( q - > queue_lock ) ;
2012-06-05 07:40:58 +04:00
q - > nr_rqs_elvpriv - - ;
2012-04-20 03:29:22 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
goto out ;
2012-03-06 01:15:23 +04:00
fail_alloc :
/*
* Allocation failed presumably due to memory . Undo anything we
* might have messed up .
*
* Allocating task should really be put onto the front of the wait
* queue , but this is pretty rare .
*/
spin_lock_irq ( q - > queue_lock ) ;
2012-06-05 07:40:59 +04:00
freed_request ( rl , rw_flags ) ;
2012-03-06 01:15:23 +04:00
/*
* in the very unlikely event that allocation failed and no
* requests for this direction was pending , mark us starved so that
* freeing of a request in the other direction will notice
* us . another possible fix would be to split the rq mempool into
* READ and WRITE
*/
rq_starved :
if ( unlikely ( rl - > count [ is_sync ] = = 0 ) )
rl - > starved [ is_sync ] = 1 ;
2014-08-28 18:15:21 +04:00
return ERR_PTR ( - ENOMEM ) ;
2005-04-17 02:20:36 +04:00
}
2011-10-19 16:33:05 +04:00
/**
2012-06-05 07:40:55 +04:00
* get_request - get a free request
2011-10-19 16:33:05 +04:00
* @ q : request_queue to allocate request from
* @ rw_flags : RW and SYNC flags
* @ bio : bio to allocate request for ( can be % NULL )
2012-06-05 07:40:55 +04:00
* @ gfp_mask : allocation mask
2011-10-19 16:33:05 +04:00
*
2012-06-05 07:40:55 +04:00
* Get a free request from @ q . If % __GFP_WAIT is set in @ gfp_mask , this
* function keeps retrying under memory pressure and fails iff @ q is dead .
2005-06-29 07:45:14 +04:00
*
2014-08-28 18:15:21 +04:00
* Must be called with @ q - > queue_lock held and ,
* Returns ERR_PTR on failure , with @ q - > queue_lock held .
* Returns request pointer on success , with @ q - > queue_lock * not held * .
2005-04-17 02:20:36 +04:00
*/
2012-06-05 07:40:55 +04:00
static struct request * get_request ( struct request_queue * q , int rw_flags ,
struct bio * bio , gfp_t gfp_mask )
2005-04-17 02:20:36 +04:00
{
2009-04-06 16:48:01 +04:00
const bool is_sync = rw_is_sync ( rw_flags ) ! = 0 ;
2012-06-05 07:40:55 +04:00
DEFINE_WAIT ( wait ) ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
struct request_list * rl ;
2005-04-17 02:20:36 +04:00
struct request * rq ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
rl = blk_get_rl ( q , bio ) ; /* transferred to @rq on success */
2012-06-05 07:40:55 +04:00
retry :
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
rq = __get_request ( rl , rw_flags , bio , gfp_mask ) ;
2014-08-28 18:15:21 +04:00
if ( ! IS_ERR ( rq ) )
2012-06-05 07:40:55 +04:00
return rq ;
2005-04-17 02:20:36 +04:00
2012-11-28 16:42:38 +04:00
if ( ! ( gfp_mask & __GFP_WAIT ) | | unlikely ( blk_queue_dying ( q ) ) ) {
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
blk_put_rl ( rl ) ;
2014-08-28 18:15:21 +04:00
return rq ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
}
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:55 +04:00
/* wait on @rl and retry */
prepare_to_wait_exclusive ( & rl - > wait [ is_sync ] , & wait ,
TASK_UNINTERRUPTIBLE ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:55 +04:00
trace_block_sleeprq ( q , bio , rw_flags & 1 ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:55 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
io_schedule ( ) ;
2005-06-29 07:45:14 +04:00
2012-06-05 07:40:55 +04:00
/*
* After sleeping , we become a " batching " process and will be able
* to allocate at least one request , and up to a big batch of them
* for a small period time . See ioc_batching , ioc_set_batching
*/
ioc_set_batching ( q , current - > io_context ) ;
2008-05-22 17:13:29 +04:00
2012-06-05 07:40:55 +04:00
spin_lock_irq ( q - > queue_lock ) ;
finish_wait ( & rl - > wait [ is_sync ] , & wait ) ;
2005-04-17 02:20:36 +04:00
2012-06-05 07:40:55 +04:00
goto retry ;
2005-04-17 02:20:36 +04:00
}
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
static struct request * blk_old_get_request ( struct request_queue * q , int rw ,
gfp_t gfp_mask )
2005-04-17 02:20:36 +04:00
{
struct request * rq ;
BUG_ON ( rw ! = READ & & rw ! = WRITE ) ;
2012-06-05 07:40:56 +04:00
/* create ioc upfront */
create_io_context ( gfp_mask , q - > node ) ;
2005-06-29 07:45:14 +04:00
spin_lock_irq ( q - > queue_lock ) ;
2012-06-05 07:40:55 +04:00
rq = get_request ( q , rw , NULL , gfp_mask ) ;
2014-08-28 18:15:21 +04:00
if ( IS_ERR ( rq ) )
2011-10-19 16:33:05 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
2005-06-29 07:45:14 +04:00
/* q->queue_lock is unlocked at this point */
2005-04-17 02:20:36 +04:00
return rq ;
}
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
struct request * blk_get_request ( struct request_queue * q , int rw , gfp_t gfp_mask )
{
if ( q - > mq_ops )
2014-05-27 22:59:46 +04:00
return blk_mq_alloc_request ( q , rw , gfp_mask , false ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
else
return blk_old_get_request ( q , rw , gfp_mask ) ;
}
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( blk_get_request ) ;
2006-07-20 16:54:05 +04:00
/**
2009-05-17 19:57:15 +04:00
* blk_make_request - given a bio , allocate a corresponding struct request .
2009-06-12 07:00:41 +04:00
* @ q : target request queue
2009-05-17 19:57:15 +04:00
* @ bio : The bio describing the memory mappings that will be submitted for IO .
* It may be a chained - bio properly constructed by block / bio layer .
2009-06-12 07:00:41 +04:00
* @ gfp_mask : gfp flags to be used for memory allocation
2006-07-20 16:54:05 +04:00
*
2009-05-17 19:57:15 +04:00
* blk_make_request is the parallel of generic_make_request for BLOCK_PC
* type commands . Where the struct request needs to be farther initialized by
* the caller . It is passed a & struct bio , which describes the memory info of
* the I / O transfer .
2006-07-20 16:54:05 +04:00
*
2009-05-17 19:57:15 +04:00
* The caller of blk_make_request must make sure that bi_io_vec
* are set to describe the memory buffers . That bio_data_dir ( ) will return
* the needed direction of the request . ( And all bio ' s in the passed bio - chain
* are properly set accordingly )
*
* If called under none - sleepable conditions , mapped bio buffers must not
* need bouncing , by calling the appropriate masked or flagged allocator ,
* suitable for the target device . Otherwise the call to blk_queue_bounce will
* BUG .
2009-05-19 21:52:35 +04:00
*
* WARNING : When allocating / cloning a bio - chain , careful consideration should be
* given to how you allocate bios . In particular , you cannot use __GFP_WAIT for
* anything but the first bio in the chain . Otherwise you risk waiting for IO
* completion of a bio that hasn ' t been submitted yet , thus resulting in a
* deadlock . Alternatively bios should be allocated using bio_kmalloc ( ) instead
* of bio_alloc ( ) , as that avoids the mempool deadlock .
* If possible a big IO should be split into smaller parts when allocation
* fails . Partial allocation should not be an error , or you risk a live - lock .
2006-07-20 16:54:05 +04:00
*/
2009-05-17 19:57:15 +04:00
struct request * blk_make_request ( struct request_queue * q , struct bio * bio ,
gfp_t gfp_mask )
2006-07-20 16:54:05 +04:00
{
2009-05-17 19:57:15 +04:00
struct request * rq = blk_get_request ( q , bio_data_dir ( bio ) , gfp_mask ) ;
2014-08-28 18:15:21 +04:00
if ( IS_ERR ( rq ) )
return rq ;
2009-05-17 19:57:15 +04:00
2014-06-06 17:57:37 +04:00
blk_rq_set_block_pc ( rq ) ;
2009-05-17 19:57:15 +04:00
for_each_bio ( bio ) {
struct bio * bounce_bio = bio ;
int ret ;
blk_queue_bounce ( q , & bounce_bio ) ;
ret = blk_rq_append_bio ( q , rq , bounce_bio ) ;
if ( unlikely ( ret ) ) {
blk_put_request ( rq ) ;
return ERR_PTR ( ret ) ;
}
}
return rq ;
2006-07-20 16:54:05 +04:00
}
2009-05-17 19:57:15 +04:00
EXPORT_SYMBOL ( blk_make_request ) ;
2006-07-20 16:54:05 +04:00
2014-06-06 17:57:37 +04:00
/**
* blk_rq_set_block_pc - initialize a requeest to type BLOCK_PC
* @ rq : request to be initialized
*
*/
void blk_rq_set_block_pc ( struct request * rq )
{
rq - > cmd_type = REQ_TYPE_BLOCK_PC ;
rq - > __data_len = 0 ;
rq - > __sector = ( sector_t ) - 1 ;
rq - > bio = rq - > biotail = NULL ;
memset ( rq - > __cmd , 0 , sizeof ( rq - > __cmd ) ) ;
}
EXPORT_SYMBOL ( blk_rq_set_block_pc ) ;
2005-04-17 02:20:36 +04:00
/**
* blk_requeue_request - put a request back on queue
* @ q : request queue where request should be inserted
* @ rq : request to be inserted
*
* Description :
* Drivers often keep queueing requests until the hardware cannot accept
* more , when that condition happens we need to put the request back
* on the queue . Must be called with queue lock held .
*/
2007-07-24 11:28:11 +04:00
void blk_requeue_request ( struct request_queue * q , struct request * rq )
2005-04-17 02:20:36 +04:00
{
2008-09-14 16:55:09 +04:00
blk_delete_timer ( rq ) ;
blk_clear_rq_complete ( rq ) ;
2008-10-30 10:34:33 +03:00
trace_block_rq_requeue ( q , rq ) ;
2006-03-23 22:00:26 +03:00
2005-04-17 02:20:36 +04:00
if ( blk_rq_tagged ( rq ) )
blk_queue_end_tag ( q , rq ) ;
2009-05-27 16:17:08 +04:00
BUG_ON ( blk_queued_rq ( rq ) ) ;
2005-04-17 02:20:36 +04:00
elv_requeue_request ( q , rq ) ;
}
EXPORT_SYMBOL ( blk_requeue_request ) ;
2011-03-08 15:19:51 +03:00
static void add_acct_request ( struct request_queue * q , struct request * rq ,
int where )
{
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
blk_account_io_start ( rq , true ) ;
2011-03-10 10:52:07 +03:00
__elv_add_request ( q , rq , where ) ;
2011-03-08 15:19:51 +03:00
}
2008-08-25 14:56:14 +04:00
static void part_round_stats_single ( int cpu , struct hd_struct * part ,
unsigned long now )
{
2014-05-10 01:48:23 +04:00
int inflight ;
2008-08-25 14:56:14 +04:00
if ( now = = part - > stamp )
return ;
2014-05-10 01:48:23 +04:00
inflight = part_in_flight ( part ) ;
if ( inflight ) {
2008-08-25 14:56:14 +04:00
__part_stat_add ( cpu , part , time_in_queue ,
2014-05-10 01:48:23 +04:00
inflight * ( now - part - > stamp ) ) ;
2008-08-25 14:56:14 +04:00
__part_stat_add ( cpu , part , io_ticks , ( now - part - > stamp ) ) ;
}
part - > stamp = now ;
}
/**
2008-10-16 09:46:23 +04:00
* part_round_stats ( ) - Round off the performance stats on a struct disk_stats .
* @ cpu : cpu number for stats access
* @ part : target partition
2005-04-17 02:20:36 +04:00
*
* The average IO queue length and utilisation statistics are maintained
* by observing the current state of the queue length and the amount of
* time it has been in this state for .
*
* Normally , that accounting is done on IO completion , but that can result
* in more than a second ' s worth of IO being accounted for within any one
* second , leading to > 100 % utilisation . To deal with that , we call this
* function to do a round - off before returning the results when reading
* / proc / diskstats . This accounts immediately for all queue usage up to
* the current jiffies and restarts the counters again .
*/
2008-08-25 14:47:21 +04:00
void part_round_stats ( int cpu , struct hd_struct * part )
2008-02-08 13:04:35 +03:00
{
unsigned long now = jiffies ;
2008-08-25 14:56:14 +04:00
if ( part - > partno )
part_round_stats_single ( cpu , & part_to_disk ( part ) - > part0 , now ) ;
part_round_stats_single ( cpu , part , now ) ;
2008-02-08 13:04:35 +03:00
}
2008-08-25 14:56:14 +04:00
EXPORT_SYMBOL_GPL ( part_round_stats ) ;
2008-02-08 13:04:35 +03:00
2013-03-23 07:42:27 +04:00
# ifdef CONFIG_PM_RUNTIME
static void blk_pm_put_request ( struct request * rq )
{
if ( rq - > q - > dev & & ! ( rq - > cmd_flags & REQ_PM ) & & ! - - rq - > q - > nr_pending )
pm_runtime_mark_last_busy ( rq - > q - > dev ) ;
}
# else
static inline void blk_pm_put_request ( struct request * rq ) { }
# endif
2005-04-17 02:20:36 +04:00
/*
* queue lock must be held
*/
2007-07-24 11:28:11 +04:00
void __blk_put_request ( struct request_queue * q , struct request * req )
2005-04-17 02:20:36 +04:00
{
if ( unlikely ( ! q ) )
return ;
2014-02-07 22:22:37 +04:00
if ( q - > mq_ops ) {
blk_mq_free_request ( req ) ;
return ;
}
2013-03-23 07:42:27 +04:00
blk_pm_put_request ( req ) ;
2005-10-20 18:23:44 +04:00
elv_completed_request ( q , req ) ;
2009-03-24 14:35:07 +03:00
/* this is a bio leak */
WARN_ON ( req - > bio ! = NULL ) ;
2005-04-17 02:20:36 +04:00
/*
* Request may not have originated from ll_rw_blk . if not ,
* it didn ' t come out of our reserved rq pools
*/
2006-08-10 10:59:11 +04:00
if ( req - > cmd_flags & REQ_ALLOCED ) {
2011-10-19 16:31:22 +04:00
unsigned int flags = req - > cmd_flags ;
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
struct request_list * rl = blk_rq_rl ( req ) ;
2005-04-17 02:20:36 +04:00
BUG_ON ( ! list_empty ( & req - > queuelist ) ) ;
2014-04-10 06:27:01 +04:00
BUG_ON ( ELV_ON_HASH ( req ) ) ;
2005-04-17 02:20:36 +04:00
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
blk_free_request ( rl , req ) ;
freed_request ( rl , flags ) ;
blk_put_rl ( rl ) ;
2005-04-17 02:20:36 +04:00
}
}
2005-11-11 14:30:24 +03:00
EXPORT_SYMBOL_GPL ( __blk_put_request ) ;
2005-04-17 02:20:36 +04:00
void blk_put_request ( struct request * req )
{
2007-07-24 11:28:11 +04:00
struct request_queue * q = req - > q ;
2005-10-20 18:23:44 +04:00
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
if ( q - > mq_ops )
blk_mq_free_request ( req ) ;
else {
unsigned long flags ;
spin_lock_irqsave ( q - > queue_lock , flags ) ;
__blk_put_request ( q , req ) ;
spin_unlock_irqrestore ( q - > queue_lock , flags ) ;
}
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( blk_put_request ) ;
2010-06-18 18:59:42 +04:00
/**
* blk_add_request_payload - add a payload to a request
* @ rq : request to update
* @ page : page backing the payload
* @ len : length of the payload .
*
* This allows to later add a payload to an already submitted request by
* a block driver . The driver needs to take care of freeing the payload
* itself .
*
* Note that this is a quite horrible hack and nothing but handling of
* discard requests should ever use it .
*/
void blk_add_request_payload ( struct request * rq , struct page * page ,
unsigned int len )
{
struct bio * bio = rq - > bio ;
bio - > bi_io_vec - > bv_page = page ;
bio - > bi_io_vec - > bv_offset = 0 ;
bio - > bi_io_vec - > bv_len = len ;
2013-10-12 02:44:27 +04:00
bio - > bi_iter . bi_size = len ;
2010-06-18 18:59:42 +04:00
bio - > bi_vcnt = 1 ;
bio - > bi_phys_segments = 1 ;
rq - > __data_len = rq - > resid_len = len ;
rq - > nr_phys_segments = 1 ;
}
EXPORT_SYMBOL_GPL ( blk_add_request_payload ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
bool bio_attempt_back_merge ( struct request_queue * q , struct request * req ,
struct bio * bio )
2011-03-08 15:19:51 +03:00
{
const int ff = bio - > bi_rw & REQ_FAILFAST_MASK ;
if ( ! ll_back_merge_fn ( q , req , bio ) )
return false ;
2013-01-12 01:06:34 +04:00
trace_block_bio_backmerge ( q , req , bio ) ;
2011-03-08 15:19:51 +03:00
if ( ( req - > cmd_flags & REQ_FAILFAST_MASK ) ! = ff )
blk_rq_set_mixed_merge ( req ) ;
req - > biotail - > bi_next = bio ;
req - > biotail = bio ;
2013-10-12 02:44:27 +04:00
req - > __data_len + = bio - > bi_iter . bi_size ;
2011-03-08 15:19:51 +03:00
req - > ioprio = ioprio_best ( req - > ioprio , bio_prio ( bio ) ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
blk_account_io_start ( req , false ) ;
2011-03-08 15:19:51 +03:00
return true ;
}
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
bool bio_attempt_front_merge ( struct request_queue * q , struct request * req ,
struct bio * bio )
2011-03-08 15:19:51 +03:00
{
const int ff = bio - > bi_rw & REQ_FAILFAST_MASK ;
if ( ! ll_front_merge_fn ( q , req , bio ) )
return false ;
2013-01-12 01:06:34 +04:00
trace_block_bio_frontmerge ( q , req , bio ) ;
2011-03-08 15:19:51 +03:00
if ( ( req - > cmd_flags & REQ_FAILFAST_MASK ) ! = ff )
blk_rq_set_mixed_merge ( req ) ;
bio - > bi_next = req - > bio ;
req - > bio = bio ;
2013-10-12 02:44:27 +04:00
req - > __sector = bio - > bi_iter . bi_sector ;
req - > __data_len + = bio - > bi_iter . bi_size ;
2011-03-08 15:19:51 +03:00
req - > ioprio = ioprio_best ( req - > ioprio , bio_prio ( bio ) ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
blk_account_io_start ( req , false ) ;
2011-03-08 15:19:51 +03:00
return true ;
}
2011-10-19 16:33:08 +04:00
/**
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
* blk_attempt_plug_merge - try to merge with % current ' s plugged list
2011-10-19 16:33:08 +04:00
* @ q : request_queue new bio is being queued at
* @ bio : new bio being queued
* @ request_count : out parameter for number of traversed plugged requests
*
* Determine whether @ bio being queued on @ q can be merged with a request
* on % current ' s plugged list . Returns % true if merge was successful ,
* otherwise % false .
*
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 12:19:42 +04:00
* Plugging coalesces IOs from the same issuer for the same purpose without
* going through @ q - > queue_lock . As such it ' s more of an issuing mechanism
* than scheduling , and the request , while may have elvpriv data , is not
* added on the elevator at this point . In addition , we don ' t have
* reliable access to the elevator outside queue lock . Only check basic
* merging parameters without querying the elevator .
2014-05-21 01:46:26 +04:00
*
* Caller must ensure ! blk_queue_nomerges ( q ) beforehand .
2011-03-08 15:19:51 +03:00
*/
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
bool blk_attempt_plug_merge ( struct request_queue * q , struct bio * bio ,
unsigned int * request_count )
2011-03-08 15:19:51 +03:00
{
struct blk_plug * plug ;
struct request * rq ;
bool ret = false ;
2013-10-29 22:01:03 +04:00
struct list_head * plug_list ;
2011-03-08 15:19:51 +03:00
2011-10-19 16:33:08 +04:00
plug = current - > plug ;
2011-03-08 15:19:51 +03:00
if ( ! plug )
goto out ;
2011-08-24 18:04:34 +04:00
* request_count = 0 ;
2011-03-08 15:19:51 +03:00
2013-10-29 22:01:03 +04:00
if ( q - > mq_ops )
plug_list = & plug - > mq_list ;
else
plug_list = & plug - > list ;
list_for_each_entry_reverse ( rq , plug_list , queuelist ) {
2011-03-08 15:19:51 +03:00
int el_ret ;
2012-04-06 21:37:47 +04:00
if ( rq - > q = = q )
( * request_count ) + + ;
2011-08-24 18:04:34 +04:00
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 12:19:42 +04:00
if ( rq - > q ! = q | | ! blk_rq_merge_ok ( rq , bio ) )
2011-03-08 15:19:51 +03:00
continue ;
2012-02-08 12:19:38 +04:00
el_ret = blk_try_merge ( rq , bio ) ;
2011-03-08 15:19:51 +03:00
if ( el_ret = = ELEVATOR_BACK_MERGE ) {
ret = bio_attempt_back_merge ( q , rq , bio ) ;
if ( ret )
break ;
} else if ( el_ret = = ELEVATOR_FRONT_MERGE ) {
ret = bio_attempt_front_merge ( q , rq , bio ) ;
if ( ret )
break ;
}
}
out :
return ret ;
}
2008-01-29 16:53:40 +03:00
void init_request_from_bio ( struct request * req , struct bio * bio )
2006-01-06 11:49:58 +03:00
{
2006-08-10 10:44:47 +04:00
req - > cmd_type = REQ_TYPE_FS ;
2006-01-06 11:49:58 +03:00
2010-08-07 20:20:39 +04:00
req - > cmd_flags | = bio - > bi_rw & REQ_COMMON_MASK ;
if ( bio - > bi_rw & REQ_RAHEAD )
2009-07-03 12:48:16 +04:00
req - > cmd_flags | = REQ_FAILFAST_MASK ;
2006-06-13 10:26:10 +04:00
2006-01-06 11:49:58 +03:00
req - > errors = 0 ;
2013-10-12 02:44:27 +04:00
req - > __sector = bio - > bi_iter . bi_sector ;
2006-01-06 11:49:58 +03:00
req - > ioprio = bio_prio ( bio ) ;
2007-08-16 15:31:30 +04:00
blk_rq_bio_prep ( req - > q , req , bio ) ;
2006-01-06 11:49:58 +03:00
}
2011-09-12 14:12:01 +04:00
void blk_queue_bio ( struct request_queue * q , struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2010-08-12 16:31:06 +04:00
const bool sync = ! ! ( bio - > bi_rw & REQ_SYNC ) ;
2011-03-08 15:19:51 +03:00
struct blk_plug * plug ;
int el_ret , rw_flags , where = ELEVATOR_INSERT_SORT ;
struct request * req ;
2011-08-24 18:04:34 +04:00
unsigned int request_count = 0 ;
2005-04-17 02:20:36 +04:00
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory ( ie for highmem , or even
* ISA dma in theory )
*/
blk_queue_bounce ( q , & bio ) ;
2013-02-22 04:42:55 +04:00
if ( bio_integrity_enabled ( bio ) & & bio_integrity_prep ( bio ) ) {
bio_endio ( bio , - EIO ) ;
return ;
}
2010-09-03 13:56:17 +04:00
if ( bio - > bi_rw & ( REQ_FLUSH | REQ_FUA ) ) {
2011-03-08 15:19:51 +03:00
spin_lock_irq ( q - > queue_lock ) ;
2011-01-25 14:43:54 +03:00
where = ELEVATOR_INSERT_FLUSH ;
2010-09-03 13:56:16 +04:00
goto get_rq ;
}
2011-03-08 15:19:51 +03:00
/*
* Check if we can merge with the plugged list before grabbing
* any locks .
*/
2014-05-21 01:46:26 +04:00
if ( ! blk_queue_nomerges ( q ) & &
blk_attempt_plug_merge ( q , bio , & request_count ) )
2011-09-12 14:12:01 +04:00
return ;
2005-04-17 02:20:36 +04:00
2011-03-08 15:19:51 +03:00
spin_lock_irq ( q - > queue_lock ) ;
2006-03-23 22:00:26 +03:00
2011-03-08 15:19:51 +03:00
el_ret = elv_merge ( q , & req , bio ) ;
if ( el_ret = = ELEVATOR_BACK_MERGE ) {
if ( bio_attempt_back_merge ( q , req , bio ) ) {
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 12:19:42 +04:00
elv_bio_merged ( q , req , bio ) ;
2011-03-08 15:19:51 +03:00
if ( ! attempt_back_merge ( q , req ) )
elv_merged_request ( q , req , el_ret ) ;
goto out_unlock ;
}
} else if ( el_ret = = ELEVATOR_FRONT_MERGE ) {
if ( bio_attempt_front_merge ( q , req , bio ) ) {
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 12:19:42 +04:00
elv_bio_merged ( q , req , bio ) ;
2011-03-08 15:19:51 +03:00
if ( ! attempt_front_merge ( q , req ) )
elv_merged_request ( q , req , el_ret ) ;
goto out_unlock ;
2009-07-03 12:48:17 +04:00
}
2005-04-17 02:20:36 +04:00
}
2005-06-29 07:45:13 +04:00
get_rq :
2006-12-13 15:02:26 +03:00
/*
* This sync check and mask will be re - done in init_request_from_bio ( ) ,
* but we need to set it earlier to expose the sync flag to the
* rq allocator and io schedulers .
*/
rw_flags = bio_data_dir ( bio ) ;
if ( sync )
2010-08-07 20:20:39 +04:00
rw_flags | = REQ_SYNC ;
2006-12-13 15:02:26 +03:00
2005-04-17 02:20:36 +04:00
/*
2005-06-29 07:45:13 +04:00
* Grab a free request . This is might sleep but can not fail .
2005-06-29 07:45:14 +04:00
* Returns with the queue unlocked .
2005-06-29 07:45:13 +04:00
*/
2012-06-05 07:40:55 +04:00
req = get_request ( q , rw_flags , bio , GFP_NOIO ) ;
2014-08-28 18:15:21 +04:00
if ( IS_ERR ( req ) ) {
bio_endio ( bio , PTR_ERR ( req ) ) ; /* @q is dead */
2011-10-19 16:33:05 +04:00
goto out_unlock ;
}
2005-06-29 07:45:14 +04:00
2005-06-29 07:45:13 +04:00
/*
* After dropping the lock and possibly sleeping here , our request
* may now be mergeable after it had proven unmergeable ( above ) .
* We don ' t worry about that case for efficiency . It won ' t happen
* often , and the elevators are able to handle it .
2005-04-17 02:20:36 +04:00
*/
2006-01-06 11:49:58 +03:00
init_request_from_bio ( req , bio ) ;
2005-04-17 02:20:36 +04:00
2011-10-24 18:11:30 +04:00
if ( test_bit ( QUEUE_FLAG_SAME_COMP , & q - > queue_flags ) )
2011-07-26 17:01:15 +04:00
req - > cpu = raw_smp_processor_id ( ) ;
2011-03-08 15:19:51 +03:00
plug = current - > plug ;
2011-03-09 13:56:30 +03:00
if ( plug ) {
2011-04-12 12:28:28 +04:00
/*
* If this is the first request added after a plug , fire
2013-09-11 23:21:07 +04:00
* of a plug trace .
2011-04-12 12:28:28 +04:00
*/
2013-09-11 23:21:07 +04:00
if ( ! request_count )
2011-04-12 12:28:28 +04:00
trace_block_plug ( q ) ;
2011-11-16 12:21:50 +04:00
else {
2011-11-16 12:21:50 +04:00
if ( request_count > = BLK_MAX_REQUEST_COUNT ) {
2011-11-16 12:21:50 +04:00
blk_flush_plug_list ( plug , false ) ;
2011-11-16 12:21:50 +04:00
trace_block_plug ( q ) ;
}
2011-03-08 15:19:51 +03:00
}
list_add_tail ( & req - > queuelist , & plug - > list ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
blk_account_io_start ( req , true ) ;
2011-03-08 15:19:51 +03:00
} else {
spin_lock_irq ( q - > queue_lock ) ;
add_acct_request ( q , req , where ) ;
2011-04-18 13:41:33 +04:00
__blk_run_queue ( q ) ;
2011-03-08 15:19:51 +03:00
out_unlock :
spin_unlock_irq ( q - > queue_lock ) ;
}
2005-04-17 02:20:36 +04:00
}
2011-09-12 14:03:37 +04:00
EXPORT_SYMBOL_GPL ( blk_queue_bio ) ; /* for device mapper only */
2005-04-17 02:20:36 +04:00
/*
* If bio - > bi_dev is a partition , remap the location
*/
static inline void blk_partition_remap ( struct bio * bio )
{
struct block_device * bdev = bio - > bi_bdev ;
2007-09-27 15:01:25 +04:00
if ( bio_sectors ( bio ) & & bdev ! = bdev - > bd_contains ) {
2005-04-17 02:20:36 +04:00
struct hd_struct * p = bdev - > bd_part ;
2013-10-12 02:44:27 +04:00
bio - > bi_iter . bi_sector + = p - > start_sect ;
2005-04-17 02:20:36 +04:00
bio - > bi_bdev = bdev - > bd_contains ;
2007-08-07 17:30:23 +04:00
2010-11-16 14:52:38 +03:00
trace_block_bio_remap ( bdev_get_queue ( bio - > bi_bdev ) , bio ,
bdev - > bd_dev ,
2013-10-12 02:44:27 +04:00
bio - > bi_iter . bi_sector - p - > start_sect ) ;
2005-04-17 02:20:36 +04:00
}
}
static void handle_bad_sector ( struct bio * bio )
{
char b [ BDEVNAME_SIZE ] ;
printk ( KERN_INFO " attempt to access beyond end of device \n " ) ;
printk ( KERN_INFO " %s: rw=%ld, want=%Lu, limit=%Lu \n " ,
bdevname ( bio - > bi_bdev , b ) ,
bio - > bi_rw ,
2012-09-26 02:05:12 +04:00
( unsigned long long ) bio_end_sector ( bio ) ,
2010-11-08 16:39:12 +03:00
( long long ) ( i_size_read ( bio - > bi_bdev - > bd_inode ) > > 9 ) ) ;
2005-04-17 02:20:36 +04:00
set_bit ( BIO_EOF , & bio - > bi_flags ) ;
}
2006-12-08 13:39:46 +03:00
# ifdef CONFIG_FAIL_MAKE_REQUEST
static DECLARE_FAULT_ATTR ( fail_make_request ) ;
static int __init setup_fail_make_request ( char * str )
{
return setup_fault_attr ( & fail_make_request , str ) ;
}
__setup ( " fail_make_request= " , setup_fail_make_request ) ;
2011-07-27 03:09:03 +04:00
static bool should_fail_request ( struct hd_struct * part , unsigned int bytes )
2006-12-08 13:39:46 +03:00
{
2011-07-27 03:09:03 +04:00
return part - > make_it_fail & & should_fail ( & fail_make_request , bytes ) ;
2006-12-08 13:39:46 +03:00
}
static int __init fail_make_request_debugfs ( void )
{
2011-08-04 03:21:01 +04:00
struct dentry * dir = fault_create_debugfs_attr ( " fail_make_request " ,
NULL , & fail_make_request ) ;
2014-04-11 11:58:56 +04:00
return PTR_ERR_OR_ZERO ( dir ) ;
2006-12-08 13:39:46 +03:00
}
late_initcall ( fail_make_request_debugfs ) ;
# else /* CONFIG_FAIL_MAKE_REQUEST */
2011-07-27 03:09:03 +04:00
static inline bool should_fail_request ( struct hd_struct * part ,
unsigned int bytes )
2006-12-08 13:39:46 +03:00
{
2011-07-27 03:09:03 +04:00
return false ;
2006-12-08 13:39:46 +03:00
}
# endif /* CONFIG_FAIL_MAKE_REQUEST */
2007-07-18 15:27:58 +04:00
/*
* Check whether this bio extends beyond the end of the device .
*/
static inline int bio_check_eod ( struct bio * bio , unsigned int nr_sectors )
{
sector_t maxsector ;
if ( ! nr_sectors )
return 0 ;
/* Test device or partition size, when known. */
2010-11-08 16:39:12 +03:00
maxsector = i_size_read ( bio - > bi_bdev - > bd_inode ) > > 9 ;
2007-07-18 15:27:58 +04:00
if ( maxsector ) {
2013-10-12 02:44:27 +04:00
sector_t sector = bio - > bi_iter . bi_sector ;
2007-07-18 15:27:58 +04:00
if ( maxsector < nr_sectors | | maxsector - nr_sectors < sector ) {
/*
* This may well happen - the kernel calls bread ( )
* without checking the size of the device , e . g . , when
* mounting a device .
*/
handle_bad_sector ( bio ) ;
return 1 ;
}
}
return 0 ;
}
2011-09-15 16:01:40 +04:00
static noinline_for_stack bool
generic_make_request_checks ( struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2007-07-24 11:28:11 +04:00
struct request_queue * q ;
2011-09-12 14:12:01 +04:00
int nr_sectors = bio_sectors ( bio ) ;
2007-11-02 10:49:08 +03:00
int err = - EIO ;
2011-09-12 14:12:01 +04:00
char b [ BDEVNAME_SIZE ] ;
struct hd_struct * part ;
2005-04-17 02:20:36 +04:00
might_sleep ( ) ;
2007-07-18 15:27:58 +04:00
if ( bio_check_eod ( bio , nr_sectors ) )
goto end_io ;
2005-04-17 02:20:36 +04:00
2011-09-12 14:12:01 +04:00
q = bdev_get_queue ( bio - > bi_bdev ) ;
if ( unlikely ( ! q ) ) {
printk ( KERN_ERR
" generic_make_request: Trying to access "
" nonexistent block-device %s (%Lu) \n " ,
bdevname ( bio - > bi_bdev , b ) ,
2013-10-12 02:44:27 +04:00
( long long ) bio - > bi_iter . bi_sector ) ;
2011-09-12 14:12:01 +04:00
goto end_io ;
}
2006-12-08 13:39:46 +03:00
2012-09-18 20:19:25 +04:00
if ( likely ( bio_is_rw ( bio ) & &
nr_sectors > queue_max_hw_sectors ( q ) ) ) {
2011-09-12 14:12:01 +04:00
printk ( KERN_ERR " bio too big device %s (%u > %u) \n " ,
bdevname ( bio - > bi_bdev , b ) ,
bio_sectors ( bio ) ,
queue_max_hw_sectors ( q ) ) ;
goto end_io ;
}
2005-04-17 02:20:36 +04:00
2011-09-12 14:12:01 +04:00
part = bio - > bi_bdev - > bd_part ;
2013-10-12 02:44:27 +04:00
if ( should_fail_request ( part , bio - > bi_iter . bi_size ) | |
2011-09-12 14:12:01 +04:00
should_fail_request ( & part_to_disk ( part ) - > part0 ,
2013-10-12 02:44:27 +04:00
bio - > bi_iter . bi_size ) )
2011-09-12 14:12:01 +04:00
goto end_io ;
2006-03-23 22:00:26 +03:00
2011-09-12 14:12:01 +04:00
/*
* If this device has partitions , remap block n
* of partition p to block n + start ( p ) of the disk .
*/
blk_partition_remap ( bio ) ;
2006-03-23 22:00:26 +03:00
2011-09-12 14:12:01 +04:00
if ( bio_check_eod ( bio , nr_sectors ) )
goto end_io ;
2010-09-03 13:56:17 +04:00
2011-09-12 14:12:01 +04:00
/*
* Filter flush bio ' s early so that make_request based
* drivers without flush support don ' t have to worry
* about them .
*/
if ( ( bio - > bi_rw & ( REQ_FLUSH | REQ_FUA ) ) & & ! q - > flush_flags ) {
bio - > bi_rw & = ~ ( REQ_FLUSH | REQ_FUA ) ;
if ( ! nr_sectors ) {
err = 0 ;
2007-11-02 10:49:08 +03:00
goto end_io ;
}
2011-09-12 14:12:01 +04:00
}
2006-10-31 09:07:21 +03:00
2011-09-12 14:12:01 +04:00
if ( ( bio - > bi_rw & REQ_DISCARD ) & &
( ! blk_queue_discard ( q ) | |
2012-09-18 20:19:25 +04:00
( ( bio - > bi_rw & REQ_SECURE ) & & ! blk_queue_secdiscard ( q ) ) ) ) {
2011-09-12 14:12:01 +04:00
err = - EOPNOTSUPP ;
goto end_io ;
}
2009-09-08 23:56:38 +04:00
2012-09-18 20:19:27 +04:00
if ( bio - > bi_rw & REQ_WRITE_SAME & & ! bdev_write_same ( bio - > bi_bdev ) ) {
2011-09-12 14:12:01 +04:00
err = - EOPNOTSUPP ;
goto end_io ;
}
2009-09-08 23:56:38 +04:00
2012-06-05 07:40:56 +04:00
/*
* Various block parts want % current - > io_context and lazy ioc
* allocation ends up trading a lot of pain for a small amount of
* memory . Just allocate it upfront . This may fail and block
* layer knows how to live with it .
*/
create_io_context ( GFP_ATOMIC , q - > node ) ;
2011-10-19 16:33:01 +04:00
if ( blk_throtl_bio ( q , bio ) )
return false ; /* throttled, will be resubmitted later */
2011-09-15 16:01:40 +04:00
2011-09-12 14:12:01 +04:00
trace_block_bio_queue ( q , bio ) ;
2011-09-15 16:01:40 +04:00
return true ;
2008-11-28 07:32:03 +03:00
end_io :
bio_endio ( bio , err ) ;
2011-09-15 16:01:40 +04:00
return false ;
2005-04-17 02:20:36 +04:00
}
2011-09-15 16:01:40 +04:00
/**
* generic_make_request - hand a buffer to its device driver for I / O
* @ bio : The bio describing the location in memory and on the device .
*
* generic_make_request ( ) is used to make I / O requests of block
* devices . It is passed a & struct bio , which describes the I / O that needs
* to be done .
*
* generic_make_request ( ) does not return any status . The
* success / failure status of the request , along with notification of
* completion , is delivered asynchronously through the bio - > bi_end_io
* function described ( one day ) else where .
*
* The caller of generic_make_request must make sure that bi_io_vec
* are set to describe the memory buffer , and that bi_dev and bi_sector are
* set to describe the device address , and the
* bi_end_io and optionally bi_private are set to describe how
* completion notification should be signaled .
*
* generic_make_request and the drivers it calls may use bi_next if this
* bio happens to be merged with someone else , and may resubmit the bio to
* a lower device by calling into generic_make_request recursively , which
* means the bio should NOT be touched after the call to - > make_request_fn .
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
*/
void generic_make_request ( struct bio * bio )
{
2010-02-23 10:55:42 +03:00
struct bio_list bio_list_on_stack ;
2011-09-15 16:01:40 +04:00
if ( ! generic_make_request_checks ( bio ) )
return ;
/*
* We only want one - > make_request_fn to be active at a time , else
* stack usage with stacked devices could be a problem . So use
* current - > bio_list to keep a list of requests submited by a
* make_request_fn function . current - > bio_list is also used as a
* flag to say if generic_make_request is currently active in this
* task or not . If it is NULL , then no make_request is active . If
* it is non - NULL , then a make_request is active , and new requests
* should be added at the tail
*/
2010-02-23 10:55:42 +03:00
if ( current - > bio_list ) {
bio_list_add ( current - > bio_list , bio ) ;
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
return ;
}
2011-09-15 16:01:40 +04:00
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
/* following loop may be a bit non-obvious, and so deserves some
* explanation .
* Before entering the loop , bio - > bi_next is NULL ( as all callers
* ensure that ) so we have a list with a single bio .
* We pretend that we have just taken it off a longer list , so
2010-02-23 10:55:42 +03:00
* we assign bio_list to a pointer to the bio_list_on_stack ,
* thus initialising the bio_list of new bios to be
2011-09-15 16:01:40 +04:00
* added . - > make_request ( ) may indeed add some more bios
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
* through a recursive call to generic_make_request . If it
* did , we find a non - NULL value in bio_list and re - enter the loop
* from the top . In this case we really did just take the bio
2010-02-23 10:55:42 +03:00
* of the top of the list ( no pretending ) and so remove it from
2011-09-15 16:01:40 +04:00
* bio_list , and call into - > make_request ( ) again .
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
*/
BUG_ON ( bio - > bi_next ) ;
2010-02-23 10:55:42 +03:00
bio_list_init ( & bio_list_on_stack ) ;
current - > bio_list = & bio_list_on_stack ;
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
do {
2011-09-15 16:01:40 +04:00
struct request_queue * q = bdev_get_queue ( bio - > bi_bdev ) ;
q - > make_request_fn ( q , bio ) ;
2010-02-23 10:55:42 +03:00
bio = bio_list_pop ( current - > bio_list ) ;
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
} while ( bio ) ;
2010-02-23 10:55:42 +03:00
current - > bio_list = NULL ; /* deactivate */
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 11:53:42 +04:00
}
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( generic_make_request ) ;
/**
2008-08-19 22:13:11 +04:00
* submit_bio - submit a bio to the block device layer for I / O
2005-04-17 02:20:36 +04:00
* @ rw : whether to % READ or % WRITE , or maybe to % READA ( read ahead )
* @ bio : The & struct bio which describes the I / O
*
* submit_bio ( ) is very similar in purpose to generic_make_request ( ) , and
* uses that function to do most of the work . Both are fairly rough
2008-08-19 22:13:11 +04:00
* interfaces ; @ bio must be presetup and ready for I / O .
2005-04-17 02:20:36 +04:00
*
*/
void submit_bio ( int rw , struct bio * bio )
{
2005-06-27 12:55:12 +04:00
bio - > bi_rw | = rw ;
2005-04-17 02:20:36 +04:00
2007-09-27 15:01:25 +04:00
/*
* If it ' s a regular read / write or a barrier with data attached ,
* go through the normal accounting stuff before submission .
*/
2012-09-18 20:19:25 +04:00
if ( bio_has_data ( bio ) ) {
2012-09-18 20:19:27 +04:00
unsigned int count ;
if ( unlikely ( rw & REQ_WRITE_SAME ) )
count = bdev_logical_block_size ( bio - > bi_bdev ) > > 9 ;
else
count = bio_sectors ( bio ) ;
2007-09-27 15:01:25 +04:00
if ( rw & WRITE ) {
count_vm_events ( PGPGOUT , count ) ;
} else {
2013-10-12 02:44:27 +04:00
task_io_account_read ( bio - > bi_iter . bi_size ) ;
2007-09-27 15:01:25 +04:00
count_vm_events ( PGPGIN , count ) ;
}
if ( unlikely ( block_dump ) ) {
char b [ BDEVNAME_SIZE ] ;
2010-09-14 10:48:01 +04:00
printk ( KERN_DEBUG " %s(%d): %s block %Lu on %s (%u sectors) \n " ,
2007-10-19 10:40:40 +04:00
current - > comm , task_pid_nr ( current ) ,
2007-09-27 15:01:25 +04:00
( rw & WRITE ) ? " WRITE " : " READ " ,
2013-10-12 02:44:27 +04:00
( unsigned long long ) bio - > bi_iter . bi_sector ,
2010-09-14 10:48:01 +04:00
bdevname ( bio - > bi_bdev , b ) ,
count ) ;
2007-09-27 15:01:25 +04:00
}
2005-04-17 02:20:36 +04:00
}
generic_make_request ( bio ) ;
}
EXPORT_SYMBOL ( submit_bio ) ;
2008-09-18 18:45:38 +04:00
/**
* blk_rq_check_limits - Helper function to check a request for the queue limit
* @ q : the queue
* @ rq : the request being checked
*
* Description :
* @ rq may have been made based on weaker limitations of upper - level queues
* in request stacking drivers , and it may violate the limitation of @ q .
* Since the block layer and the underlying device driver trust @ rq
* after it is inserted to @ q , it should be checked against @ q before
* the insertion using this generic function .
*
* This function should also be useful for request stacking drivers
2010-08-06 23:11:15 +04:00
* in some cases below , so export this function .
2008-09-18 18:45:38 +04:00
* Request stacking drivers like request - based dm may change the queue
* limits while requests are in the queue ( e . g . dm ' s table swapping ) .
2014-02-18 17:54:36 +04:00
* Such request stacking drivers should check those requests against
2008-09-18 18:45:38 +04:00
* the new queue limits again when they dispatch those requests ,
* although such checkings are also done against the old queue limits
* when submitting requests .
*/
int blk_rq_check_limits ( struct request_queue * q , struct request * rq )
{
2012-09-18 20:19:25 +04:00
if ( ! rq_mergeable ( rq ) )
2010-08-08 20:11:33 +04:00
return 0 ;
2012-09-18 20:19:26 +04:00
if ( blk_rq_sectors ( rq ) > blk_queue_get_max_sectors ( q , rq - > cmd_flags ) ) {
2008-09-18 18:45:38 +04:00
printk ( KERN_ERR " %s: over max size limit. \n " , __func__ ) ;
return - EIO ;
}
/*
* queue ' s settings related to segment counting like q - > bounce_pfn
* may differ from that of other stacking queues .
* Recalculate it to check the request correctly on this queue ' s
* limitation .
*/
blk_recalc_rq_segments ( rq ) ;
2010-02-26 08:20:39 +03:00
if ( rq - > nr_phys_segments > queue_max_segments ( q ) ) {
2008-09-18 18:45:38 +04:00
printk ( KERN_ERR " %s: over max segments limit. \n " , __func__ ) ;
return - EIO ;
}
return 0 ;
}
EXPORT_SYMBOL_GPL ( blk_rq_check_limits ) ;
/**
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
* @ q : the queue to submit the request
* @ rq : the request being queued
*/
int blk_insert_cloned_request ( struct request_queue * q , struct request * rq )
{
unsigned long flags ;
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 23:37:25 +04:00
int where = ELEVATOR_INSERT_BACK ;
2008-09-18 18:45:38 +04:00
if ( blk_rq_check_limits ( q , rq ) )
return - EIO ;
2011-07-27 03:09:03 +04:00
if ( rq - > rq_disk & &
should_fail_request ( & rq - > rq_disk - > part0 , blk_rq_bytes ( rq ) ) )
2008-09-18 18:45:38 +04:00
return - EIO ;
spin_lock_irqsave ( q - > queue_lock , flags ) ;
2012-11-28 16:42:38 +04:00
if ( unlikely ( blk_queue_dying ( q ) ) ) {
2011-12-14 03:33:37 +04:00
spin_unlock_irqrestore ( q - > queue_lock , flags ) ;
return - ENODEV ;
}
2008-09-18 18:45:38 +04:00
/*
* Submitting request must be dequeued before calling this function
* because it will be linked to another request_queue
*/
BUG_ON ( blk_queued_rq ( rq ) ) ;
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 23:37:25 +04:00
if ( rq - > cmd_flags & ( REQ_FLUSH | REQ_FUA ) )
where = ELEVATOR_INSERT_FLUSH ;
add_acct_request ( q , rq , where ) ;
2011-10-17 14:57:23 +04:00
if ( where = = ELEVATOR_INSERT_FLUSH )
__blk_run_queue ( q ) ;
2008-09-18 18:45:38 +04:00
spin_unlock_irqrestore ( q - > queue_lock , flags ) ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( blk_insert_cloned_request ) ;
2009-07-03 12:48:17 +04:00
/**
* blk_rq_err_bytes - determine number of bytes till the next failure boundary
* @ rq : request to examine
*
* Description :
* A request could be merge of IOs which require different failure
* handling . This function determines the number of bytes which
* can be failed from the beginning of the request without
* crossing into area which need to be retried further .
*
* Return :
* The number of bytes to fail .
*
* Context :
* queue_lock must be held .
*/
unsigned int blk_rq_err_bytes ( const struct request * rq )
{
unsigned int ff = rq - > cmd_flags & REQ_FAILFAST_MASK ;
unsigned int bytes = 0 ;
struct bio * bio ;
if ( ! ( rq - > cmd_flags & REQ_MIXED_MERGE ) )
return blk_rq_bytes ( rq ) ;
/*
* Currently the only ' mixing ' which can happen is between
* different fastfail types . We can safely fail portions
* which have all the failfast bits that the first one has -
* the ones which are at least as eager to fail as the first
* one .
*/
for ( bio = rq - > bio ; bio ; bio = bio - > bi_next ) {
if ( ( bio - > bi_rw & ff ) ! = ff )
break ;
2013-10-12 02:44:27 +04:00
bytes + = bio - > bi_iter . bi_size ;
2009-07-03 12:48:17 +04:00
}
/* this could lead to infinite loop */
BUG_ON ( blk_rq_bytes ( rq ) & & ! bytes ) ;
return bytes ;
}
EXPORT_SYMBOL_GPL ( blk_rq_err_bytes ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
void blk_account_io_completion ( struct request * req , unsigned int bytes )
2009-01-23 12:54:44 +03:00
{
2009-04-24 10:10:11 +04:00
if ( blk_do_io_stat ( req ) ) {
2009-01-23 12:54:44 +03:00
const int rw = rq_data_dir ( req ) ;
struct hd_struct * part ;
int cpu ;
cpu = part_stat_lock ( ) ;
2011-01-05 18:57:38 +03:00
part = req - > part ;
2009-01-23 12:54:44 +03:00
part_stat_add ( cpu , part , sectors [ rw ] , bytes > > 9 ) ;
part_stat_unlock ( ) ;
}
}
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
void blk_account_io_done ( struct request * req )
2009-01-23 12:54:44 +03:00
{
/*
2010-09-03 13:56:16 +04:00
* Account IO completion . flush_rq isn ' t accounted as a
* normal IO on queueing nor completion . Accounting the
* containing request is enough .
2009-01-23 12:54:44 +03:00
*/
2011-01-25 14:43:49 +03:00
if ( blk_do_io_stat ( req ) & & ! ( req - > cmd_flags & REQ_FLUSH_SEQ ) ) {
2009-01-23 12:54:44 +03:00
unsigned long duration = jiffies - req - > start_time ;
const int rw = rq_data_dir ( req ) ;
struct hd_struct * part ;
int cpu ;
cpu = part_stat_lock ( ) ;
2011-01-05 18:57:38 +03:00
part = req - > part ;
2009-01-23 12:54:44 +03:00
part_stat_inc ( cpu , part , ios [ rw ] ) ;
part_stat_add ( cpu , part , ticks [ rw ] , duration ) ;
part_round_stats ( cpu , part ) ;
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 22:16:55 +04:00
part_dec_in_flight ( part , rw ) ;
2009-01-23 12:54:44 +03:00
2011-01-07 10:43:37 +03:00
hd_struct_put ( part ) ;
2009-01-23 12:54:44 +03:00
part_stat_unlock ( ) ;
}
}
2013-03-23 07:42:27 +04:00
# ifdef CONFIG_PM_RUNTIME
/*
* Don ' t process normal requests when queue is suspended
* or in the process of suspending / resuming
*/
static struct request * blk_pm_peek_request ( struct request_queue * q ,
struct request * rq )
{
if ( q - > dev & & ( q - > rpm_status = = RPM_SUSPENDED | |
( q - > rpm_status ! = RPM_ACTIVE & & ! ( rq - > cmd_flags & REQ_PM ) ) ) )
return NULL ;
else
return rq ;
}
# else
static inline struct request * blk_pm_peek_request ( struct request_queue * q ,
struct request * rq )
{
return rq ;
}
# endif
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
void blk_account_io_start ( struct request * rq , bool new_io )
{
struct hd_struct * part ;
int rw = rq_data_dir ( rq ) ;
int cpu ;
if ( ! blk_do_io_stat ( rq ) )
return ;
cpu = part_stat_lock ( ) ;
if ( ! new_io ) {
part = rq - > part ;
part_stat_inc ( cpu , part , merges [ rw ] ) ;
} else {
part = disk_map_sector_rcu ( rq - > rq_disk , blk_rq_pos ( rq ) ) ;
if ( ! hd_struct_try_get ( part ) ) {
/*
* The partition is already being removed ,
* the request will be accounted on the disk only
*
* We take a reference on disk - > part0 although that
* partition will never be deleted , so we can treat
* it as any other partition .
*/
part = & rq - > rq_disk - > part0 ;
hd_struct_get ( part ) ;
}
part_round_stats ( cpu , part ) ;
part_inc_in_flight ( part , rw ) ;
rq - > part = part ;
}
part_stat_unlock ( ) ;
}
2007-12-12 01:52:28 +03:00
/**
2009-05-08 06:54:16 +04:00
* blk_peek_request - peek at the top of a request queue
* @ q : request queue to peek at
*
* Description :
* Return the request at the top of @ q . The returned request
* should be started using blk_start_request ( ) before LLD starts
* processing it .
*
* Return :
* Pointer to the request at the top of @ q if available . Null
* otherwise .
*
* Context :
* queue_lock must be held .
*/
struct request * blk_peek_request ( struct request_queue * q )
2009-04-23 06:05:18 +04:00
{
struct request * rq ;
int ret ;
while ( ( rq = __elv_next_request ( q ) ) ! = NULL ) {
2013-03-23 07:42:27 +04:00
rq = blk_pm_peek_request ( q , rq ) ;
if ( ! rq )
break ;
2009-04-23 06:05:18 +04:00
if ( ! ( rq - > cmd_flags & REQ_STARTED ) ) {
/*
* This is the first time the device driver
* sees this request ( possibly after
* requeueing ) . Notify IO scheduler .
*/
2010-08-07 20:17:56 +04:00
if ( rq - > cmd_flags & REQ_SORTED )
2009-04-23 06:05:18 +04:00
elv_activate_rq ( q , rq ) ;
/*
* just mark as started even if we don ' t start
* it , a request that has been delayed should
* not be passed by new incoming requests
*/
rq - > cmd_flags | = REQ_STARTED ;
trace_block_rq_issue ( q , rq ) ;
}
if ( ! q - > boundary_rq | | q - > boundary_rq = = rq ) {
q - > end_sector = rq_end_sector ( rq ) ;
q - > boundary_rq = NULL ;
}
if ( rq - > cmd_flags & REQ_DONTPREP )
break ;
2009-05-07 17:24:41 +04:00
if ( q - > dma_drain_size & & blk_rq_bytes ( rq ) ) {
2009-04-23 06:05:18 +04:00
/*
* make sure space for the drain appears we
* know we can do this because max_hw_segments
* has been adjusted to be one fewer than the
* device can handle
*/
rq - > nr_phys_segments + + ;
}
if ( ! q - > prep_rq_fn )
break ;
ret = q - > prep_rq_fn ( q , rq ) ;
if ( ret = = BLKPREP_OK ) {
break ;
} else if ( ret = = BLKPREP_DEFER ) {
/*
* the request may have been ( partially ) prepped .
* we need to keep this request in the front to
* avoid resource deadlock . REQ_STARTED will
* prevent other fs requests from passing this one .
*/
2009-05-07 17:24:41 +04:00
if ( q - > dma_drain_size & & blk_rq_bytes ( rq ) & &
2009-04-23 06:05:18 +04:00
! ( rq - > cmd_flags & REQ_DONTPREP ) ) {
/*
* remove the space for the drain we added
* so that we don ' t add it again
*/
- - rq - > nr_phys_segments ;
}
rq = NULL ;
break ;
} else if ( ret = = BLKPREP_KILL ) {
rq - > cmd_flags | = REQ_QUIET ;
2009-05-30 08:43:49 +04:00
/*
* Mark this request as started so we don ' t trigger
* any debug logic in the end I / O path .
*/
blk_start_request ( rq ) ;
2009-04-23 06:05:19 +04:00
__blk_end_request_all ( rq , - EIO ) ;
2009-04-23 06:05:18 +04:00
} else {
printk ( KERN_ERR " %s: bad return=%d \n " , __func__ , ret ) ;
break ;
}
}
return rq ;
}
2009-05-08 06:54:16 +04:00
EXPORT_SYMBOL ( blk_peek_request ) ;
2009-04-23 06:05:18 +04:00
2009-05-08 06:54:16 +04:00
void blk_dequeue_request ( struct request * rq )
2009-04-23 06:05:18 +04:00
{
2009-05-08 06:54:16 +04:00
struct request_queue * q = rq - > q ;
2009-04-23 06:05:18 +04:00
BUG_ON ( list_empty ( & rq - > queuelist ) ) ;
BUG_ON ( ELV_ON_HASH ( rq ) ) ;
list_del_init ( & rq - > queuelist ) ;
/*
* the time frame between a request being removed from the lists
* and to it is freed is accounted as io that is in progress at
* the driver side .
*/
2010-04-02 02:01:41 +04:00
if ( blk_account_rq ( rq ) ) {
2009-05-20 10:54:31 +04:00
q - > in_flight [ rq_is_sync ( rq ) ] + + ;
2010-04-02 02:01:41 +04:00
set_io_start_time_ns ( rq ) ;
}
2009-04-23 06:05:18 +04:00
}
2009-05-08 06:54:16 +04:00
/**
* blk_start_request - start request processing on the driver
* @ req : request to dequeue
*
* Description :
* Dequeue @ req and start timeout timer on it . This hands off the
* request to the driver .
*
* Block internal functions which don ' t want to start timer should
* call blk_dequeue_request ( ) .
*
* Context :
* queue_lock must be held .
*/
void blk_start_request ( struct request * req )
{
blk_dequeue_request ( req ) ;
/*
2009-05-19 13:33:05 +04:00
* We are now handing the request to the hardware , initialize
* resid_len to full count and add the timeout handler .
2009-05-08 06:54:16 +04:00
*/
2009-05-19 13:33:05 +04:00
req - > resid_len = blk_rq_bytes ( req ) ;
2009-06-09 07:47:10 +04:00
if ( unlikely ( blk_bidi_rq ( req ) ) )
req - > next_rq - > resid_len = blk_rq_bytes ( req - > next_rq ) ;
block: fix race between request completion and timeout handling
crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
<0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
<0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
[<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
[<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
[<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
[<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
[<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
[<ffffffff810908c6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090830>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP <ffff881057eefd60>
The RIP is this line:
BUG_ON(blk_queued_rq(rq));
After digging through the code, I think there may be a race between the
request completion and the timer handler running.
A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:
static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}
and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.
Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.
This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.
This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-08 22:36:41 +04:00
BUG_ON ( test_bit ( REQ_ATOM_COMPLETE , & req - > atomic_flags ) ) ;
2009-05-08 06:54:16 +04:00
blk_add_timer ( req ) ;
}
EXPORT_SYMBOL ( blk_start_request ) ;
/**
* blk_fetch_request - fetch a request from a request queue
* @ q : request queue to fetch a request from
*
* Description :
* Return the request at the top of @ q . The request is started on
* return and LLD can start processing it immediately .
*
* Return :
* Pointer to the request at the top of @ q if available . Null
* otherwise .
*
* Context :
* queue_lock must be held .
*/
struct request * blk_fetch_request ( struct request_queue * q )
{
struct request * rq ;
rq = blk_peek_request ( q ) ;
if ( rq )
blk_start_request ( rq ) ;
return rq ;
}
EXPORT_SYMBOL ( blk_fetch_request ) ;
2007-12-12 01:52:28 +03:00
/**
2009-04-23 06:05:18 +04:00
* blk_update_request - Special helper function for request stacking drivers
2009-06-12 07:00:41 +04:00
* @ req : the request being processed
2008-08-19 22:13:11 +04:00
* @ error : % 0 for success , < % 0 for error
2009-06-12 07:00:41 +04:00
* @ nr_bytes : number of bytes to complete @ req
2007-12-12 01:52:28 +03:00
*
* Description :
2009-06-12 07:00:41 +04:00
* Ends I / O on a number of bytes attached to @ req , but doesn ' t complete
* the request structure even if @ req doesn ' t have leftover .
* If @ req has leftover , sets it up for the next range of segments .
2009-04-23 06:05:18 +04:00
*
* This special helper function is only for request stacking drivers
* ( e . g . request - based dm ) so that they can handle partial completion .
* Actual device drivers should use blk_end_request instead .
*
* Passing the result of blk_rq_bytes ( ) as @ nr_bytes guarantees
* % false return from this function .
2007-12-12 01:52:28 +03:00
*
* Return :
2009-04-23 06:05:18 +04:00
* % false - this request doesn ' t have any more data
* % true - this request has more data
2007-12-12 01:52:28 +03:00
* */
2009-04-23 06:05:18 +04:00
bool blk_update_request ( struct request * req , int error , unsigned int nr_bytes )
2005-04-17 02:20:36 +04:00
{
2012-09-21 03:38:30 +04:00
int total_bytes ;
2005-04-17 02:20:36 +04:00
2014-10-01 16:32:31 +04:00
trace_block_rq_complete ( req - > q , req , nr_bytes ) ;
2009-04-23 06:05:18 +04:00
if ( ! req - > bio )
return false ;
2005-04-17 02:20:36 +04:00
/*
2009-04-19 02:00:41 +04:00
* For fs requests , rq is just carrier of independent bio ' s
* and each partial completion should be handled separately .
* Reset per - request error on each partial completion .
*
* TODO : tj : This is too subtle . It would be better to let
* low level drivers do what they see fit .
2005-04-17 02:20:36 +04:00
*/
2010-08-07 20:17:56 +04:00
if ( req - > cmd_type = = REQ_TYPE_FS )
2005-04-17 02:20:36 +04:00
req - > errors = 0 ;
2010-08-07 20:17:56 +04:00
if ( error & & req - > cmd_type = = REQ_TYPE_FS & &
! ( req - > cmd_flags & REQ_QUIET ) ) {
2011-01-18 12:13:13 +03:00
char * error_type ;
switch ( error ) {
case - ENOLINK :
error_type = " recoverable transport " ;
break ;
case - EREMOTEIO :
error_type = " critical target " ;
break ;
case - EBADE :
error_type = " critical nexus " ;
break ;
2013-01-30 13:26:16 +04:00
case - ETIMEDOUT :
error_type = " timeout " ;
break ;
2013-07-01 17:16:25 +04:00
case - ENOSPC :
error_type = " critical space allocation " ;
break ;
2013-07-01 17:16:26 +04:00
case - ENODATA :
error_type = " critical medium " ;
break ;
2011-01-18 12:13:13 +03:00
case - EIO :
default :
error_type = " I/O " ;
break ;
}
block: make blk_update_request print prefix match ratelimited prefix
In blk_update_request, change the printk_ratelimited
prefix from end_request to blk_update_request so it
matches the name printed if rate limiting occurs.
Old:
[10234.933106] blk_update_request: 174 callbacks suppressed
[10234.934940] end_request: critical target error, dev sdr, sector 16
[10234.949788] end_request: critical target error, dev sdr, sector 16
New:
[16863.445173] blk_update_request: 398 callbacks suppressed
[16863.447029] blk_update_request: critical target error, dev sdr, sector
1442066176
[16863.449383] blk_update_request: critical target error, dev sdr, sector
802802888
[16863.451680] blk_update_request: critical target error, dev sdr, sector
1609535456
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-27 19:50:31 +04:00
printk_ratelimited ( KERN_ERR " %s: %s error, dev %s, sector %llu \n " ,
__func__ , error_type , req - > rq_disk ?
2012-08-31 03:26:25 +04:00
req - > rq_disk - > disk_name : " ? " ,
( unsigned long long ) blk_rq_pos ( req ) ) ;
2005-04-17 02:20:36 +04:00
}
2009-01-23 12:54:44 +03:00
blk_account_io_completion ( req , nr_bytes ) ;
2005-11-01 10:35:42 +03:00
2012-09-21 03:38:30 +04:00
total_bytes = 0 ;
while ( req - > bio ) {
struct bio * bio = req - > bio ;
2013-10-12 02:44:27 +04:00
unsigned bio_bytes = min ( bio - > bi_iter . bi_size , nr_bytes ) ;
2005-04-17 02:20:36 +04:00
2013-10-12 02:44:27 +04:00
if ( bio_bytes = = bio - > bi_iter . bi_size )
2005-04-17 02:20:36 +04:00
req - > bio = bio - > bi_next ;
2012-09-21 03:38:30 +04:00
req_bio_endio ( req , bio , bio_bytes , error ) ;
2005-04-17 02:20:36 +04:00
2012-09-21 03:38:30 +04:00
total_bytes + = bio_bytes ;
nr_bytes - = bio_bytes ;
2005-04-17 02:20:36 +04:00
2012-09-21 03:38:30 +04:00
if ( ! nr_bytes )
break ;
2005-04-17 02:20:36 +04:00
}
/*
* completely done
*/
2009-04-23 06:05:18 +04:00
if ( ! req - > bio ) {
/*
* Reset counters so that the request stacking driver
* can find how many bytes remain in the request
* later .
*/
2009-05-07 17:24:44 +04:00
req - > __data_len = 0 ;
2009-04-23 06:05:18 +04:00
return false ;
}
2005-04-17 02:20:36 +04:00
2009-05-07 17:24:44 +04:00
req - > __data_len - = total_bytes ;
2009-05-07 17:24:41 +04:00
/* update sector only for requests with clear definition of sector */
2012-09-18 20:19:25 +04:00
if ( req - > cmd_type = = REQ_TYPE_FS )
2009-05-07 17:24:44 +04:00
req - > __sector + = total_bytes > > 9 ;
2009-05-07 17:24:41 +04:00
2009-07-03 12:48:17 +04:00
/* mixed attributes always follow the first bio */
if ( req - > cmd_flags & REQ_MIXED_MERGE ) {
req - > cmd_flags & = ~ REQ_FAILFAST_MASK ;
req - > cmd_flags | = req - > bio - > bi_rw & REQ_FAILFAST_MASK ;
}
2009-05-07 17:24:41 +04:00
/*
* If total number of sectors is less than the first segment
* size , something has gone terribly wrong .
*/
if ( blk_rq_bytes ( req ) < blk_rq_cur_bytes ( req ) ) {
2011-03-30 11:51:33 +04:00
blk_dump_rq_flags ( req , " request botched " ) ;
2009-05-07 17:24:44 +04:00
req - > __data_len = blk_rq_cur_bytes ( req ) ;
2009-05-07 17:24:41 +04:00
}
/* recalculate the number of segments */
2005-04-17 02:20:36 +04:00
blk_recalc_rq_segments ( req ) ;
2009-05-07 17:24:41 +04:00
2009-04-23 06:05:18 +04:00
return true ;
2005-04-17 02:20:36 +04:00
}
2009-04-23 06:05:18 +04:00
EXPORT_SYMBOL_GPL ( blk_update_request ) ;
2005-04-17 02:20:36 +04:00
2009-04-23 06:05:18 +04:00
static bool blk_update_bidi_request ( struct request * rq , int error ,
unsigned int nr_bytes ,
unsigned int bidi_bytes )
2009-04-23 06:05:18 +04:00
{
2009-04-23 06:05:18 +04:00
if ( blk_update_request ( rq , error , nr_bytes ) )
return true ;
2009-04-23 06:05:18 +04:00
2009-04-23 06:05:18 +04:00
/* Bidi request must be completed as a whole */
if ( unlikely ( blk_bidi_rq ( rq ) ) & &
blk_update_request ( rq - > next_rq , error , bidi_bytes ) )
return true ;
2009-04-23 06:05:18 +04:00
2010-06-09 12:42:09 +04:00
if ( blk_queue_add_random ( rq - > q ) )
add_disk_randomness ( rq - > rq_disk ) ;
2009-04-23 06:05:18 +04:00
return false ;
2005-04-17 02:20:36 +04:00
}
2010-07-01 14:49:17 +04:00
/**
* blk_unprep_request - unprepare a request
* @ req : the request
*
* This function makes a request ready for complete resubmission ( or
* completion ) . It happens only after all error handling is complete ,
* so represents the appropriate moment to deallocate any resources
* that were allocated to the request in the prep_rq_fn . The queue
* lock is held when calling this .
*/
void blk_unprep_request ( struct request * req )
{
struct request_queue * q = req - > q ;
req - > cmd_flags & = ~ REQ_DONTPREP ;
if ( q - > unprep_rq_fn )
q - > unprep_rq_fn ( q , req ) ;
}
EXPORT_SYMBOL_GPL ( blk_unprep_request ) ;
2005-04-17 02:20:36 +04:00
/*
* queue lock must be held
*/
2014-04-16 11:44:59 +04:00
void blk_finish_request ( struct request * req , int error )
2005-04-17 02:20:36 +04:00
{
2007-12-12 01:53:24 +03:00
if ( blk_rq_tagged ( req ) )
blk_queue_end_tag ( req - > q , req ) ;
2009-05-27 16:17:08 +04:00
BUG_ON ( blk_queued_rq ( req ) ) ;
2005-04-17 02:20:36 +04:00
2010-08-07 20:17:56 +04:00
if ( unlikely ( laptop_mode ) & & req - > cmd_type = = REQ_TYPE_FS )
2010-04-06 16:25:14 +04:00
laptop_io_completion ( & req - > q - > backing_dev_info ) ;
2005-04-17 02:20:36 +04:00
2008-10-30 12:16:20 +03:00
blk_delete_timer ( req ) ;
2010-07-01 14:49:17 +04:00
if ( req - > cmd_flags & REQ_DONTPREP )
blk_unprep_request ( req ) ;
2009-01-23 12:54:44 +03:00
blk_account_io_done ( req ) ;
2007-12-12 01:53:24 +03:00
2005-04-17 02:20:36 +04:00
if ( req - > end_io )
2006-01-06 11:49:03 +03:00
req - > end_io ( req , error ) ;
2007-12-12 01:53:24 +03:00
else {
if ( blk_bidi_rq ( req ) )
__blk_put_request ( req - > next_rq - > q , req - > next_rq ) ;
2005-04-17 02:20:36 +04:00
__blk_put_request ( req - > q , req ) ;
2007-12-12 01:53:24 +03:00
}
2005-04-17 02:20:36 +04:00
}
2014-04-16 11:44:59 +04:00
EXPORT_SYMBOL ( blk_finish_request ) ;
2005-04-17 02:20:36 +04:00
2007-12-12 01:41:17 +03:00
/**
2009-04-23 06:05:18 +04:00
* blk_end_bidi_request - Complete a bidi request
* @ rq : the request to complete
* @ error : % 0 for success , < % 0 for error
* @ nr_bytes : number of bytes to complete @ rq
* @ bidi_bytes : number of bytes to complete @ rq - > next_rq
2007-09-21 12:41:07 +04:00
*
* Description :
2007-12-12 01:51:46 +03:00
* Ends I / O on a number of bytes attached to @ rq and @ rq - > next_rq .
2009-04-23 06:05:18 +04:00
* Drivers that supports bidi can safely call this member for any
* type of request , bidi or uni . In the later case @ bidi_bytes is
* just ignored .
2007-12-12 01:40:30 +03:00
*
* Return :
2009-04-23 06:05:18 +04:00
* % false - we are done with this request
* % true - still buffers pending for this request
2007-09-21 12:41:07 +04:00
* */
2009-05-11 12:56:09 +04:00
static bool blk_end_bidi_request ( struct request * rq , int error ,
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
unsigned int nr_bytes , unsigned int bidi_bytes )
{
2007-12-12 01:40:30 +03:00
struct request_queue * q = rq - > q ;
2009-04-23 06:05:18 +04:00
unsigned long flags ;
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
2009-04-23 06:05:18 +04:00
if ( blk_update_bidi_request ( rq , error , nr_bytes , bidi_bytes ) )
return true ;
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
2007-12-12 01:40:30 +03:00
spin_lock_irqsave ( q - > queue_lock , flags ) ;
2009-04-23 06:05:18 +04:00
blk_finish_request ( rq , error ) ;
2007-12-12 01:40:30 +03:00
spin_unlock_irqrestore ( q - > queue_lock , flags ) ;
2009-04-23 06:05:18 +04:00
return false ;
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
}
2007-12-12 01:40:30 +03:00
/**
2009-04-23 06:05:18 +04:00
* __blk_end_bidi_request - Complete a bidi request with queue lock held
* @ rq : the request to complete
2008-08-19 22:13:11 +04:00
* @ error : % 0 for success , < % 0 for error
2007-12-12 01:51:46 +03:00
* @ nr_bytes : number of bytes to complete @ rq
* @ bidi_bytes : number of bytes to complete @ rq - > next_rq
2007-12-12 01:40:30 +03:00
*
* Description :
2009-04-23 06:05:18 +04:00
* Identical to blk_end_bidi_request ( ) except that queue lock is
* assumed to be locked on entry and remains so on return .
2007-12-12 01:40:30 +03:00
*
* Return :
2009-04-23 06:05:18 +04:00
* % false - we are done with this request
* % true - still buffers pending for this request
2007-12-12 01:40:30 +03:00
* */
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 23:37:25 +04:00
bool __blk_end_bidi_request ( struct request * rq , int error ,
2009-05-11 12:56:09 +04:00
unsigned int nr_bytes , unsigned int bidi_bytes )
2007-12-12 01:40:30 +03:00
{
2009-04-23 06:05:18 +04:00
if ( blk_update_bidi_request ( rq , error , nr_bytes , bidi_bytes ) )
return true ;
2007-12-12 01:40:30 +03:00
2009-04-23 06:05:18 +04:00
blk_finish_request ( rq , error ) ;
2007-12-12 01:40:30 +03:00
2009-04-23 06:05:18 +04:00
return false ;
2007-12-12 01:40:30 +03:00
}
2007-12-12 01:51:02 +03:00
/**
* blk_end_request - Helper function for drivers to complete the request .
* @ rq : the request being processed
2008-08-19 22:13:11 +04:00
* @ error : % 0 for success , < % 0 for error
2007-12-12 01:51:02 +03:00
* @ nr_bytes : number of bytes to complete
*
* Description :
* Ends I / O on a number of bytes attached to @ rq .
* If @ rq has leftover , sets it up for the next range of segments .
*
* Return :
2009-05-11 12:56:09 +04:00
* % false - we are done with this request
* % true - still buffers pending for this request
2007-12-12 01:51:02 +03:00
* */
2009-05-11 12:56:09 +04:00
bool blk_end_request ( struct request * rq , int error , unsigned int nr_bytes )
2007-12-12 01:51:02 +03:00
{
2009-05-11 12:56:09 +04:00
return blk_end_bidi_request ( rq , error , nr_bytes , 0 ) ;
2007-12-12 01:51:02 +03:00
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( blk_end_request ) ;
2007-12-12 01:40:30 +03:00
/**
2009-05-11 12:56:09 +04:00
* blk_end_request_all - Helper function for drives to finish the request .
* @ rq : the request to finish
2009-06-12 07:00:41 +04:00
* @ error : % 0 for success , < % 0 for error
2007-12-12 01:40:30 +03:00
*
* Description :
2009-05-11 12:56:09 +04:00
* Completely finish @ rq .
*/
void blk_end_request_all ( struct request * rq , int error )
2007-12-12 01:40:30 +03:00
{
2009-05-11 12:56:09 +04:00
bool pending ;
unsigned int bidi_bytes = 0 ;
2007-12-12 01:40:30 +03:00
2009-05-11 12:56:09 +04:00
if ( unlikely ( blk_bidi_rq ( rq ) ) )
bidi_bytes = blk_rq_bytes ( rq - > next_rq ) ;
2007-12-12 01:40:30 +03:00
2009-05-11 12:56:09 +04:00
pending = blk_end_bidi_request ( rq , error , blk_rq_bytes ( rq ) , bidi_bytes ) ;
BUG_ON ( pending ) ;
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( blk_end_request_all ) ;
2007-12-12 01:40:30 +03:00
2009-05-11 12:56:09 +04:00
/**
* blk_end_request_cur - Helper function to finish the current request chunk .
* @ rq : the request to finish the current chunk for
2009-06-12 07:00:41 +04:00
* @ error : % 0 for success , < % 0 for error
2009-05-11 12:56:09 +04:00
*
* Description :
* Complete the current consecutively mapped chunk from @ rq .
*
* Return :
* % false - we are done with this request
* % true - still buffers pending for this request
*/
bool blk_end_request_cur ( struct request * rq , int error )
{
return blk_end_request ( rq , error , blk_rq_cur_bytes ( rq ) ) ;
2007-12-12 01:40:30 +03:00
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( blk_end_request_cur ) ;
2007-12-12 01:40:30 +03:00
2009-07-03 12:48:17 +04:00
/**
* blk_end_request_err - Finish a request till the next failure boundary .
* @ rq : the request to finish till the next failure boundary for
* @ error : must be negative errno
*
* Description :
* Complete @ rq till the next failure boundary .
*
* Return :
* % false - we are done with this request
* % true - still buffers pending for this request
*/
bool blk_end_request_err ( struct request * rq , int error )
{
WARN_ON ( error > = 0 ) ;
return blk_end_request ( rq , error , blk_rq_err_bytes ( rq ) ) ;
}
EXPORT_SYMBOL_GPL ( blk_end_request_err ) ;
2007-12-12 01:51:46 +03:00
/**
2009-05-11 12:56:09 +04:00
* __blk_end_request - Helper function for drivers to complete the request .
* @ rq : the request being processed
* @ error : % 0 for success , < % 0 for error
* @ nr_bytes : number of bytes to complete
2007-12-12 01:51:46 +03:00
*
* Description :
2009-05-11 12:56:09 +04:00
* Must be called with queue lock held unlike blk_end_request ( ) .
2007-12-12 01:51:46 +03:00
*
* Return :
2009-05-11 12:56:09 +04:00
* % false - we are done with this request
* % true - still buffers pending for this request
2007-12-12 01:51:46 +03:00
* */
2009-05-11 12:56:09 +04:00
bool __blk_end_request ( struct request * rq , int error , unsigned int nr_bytes )
2007-12-12 01:51:46 +03:00
{
2009-05-11 12:56:09 +04:00
return __blk_end_bidi_request ( rq , error , nr_bytes , 0 ) ;
2007-12-12 01:51:46 +03:00
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( __blk_end_request ) ;
2007-12-12 01:51:46 +03:00
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
/**
2009-05-11 12:56:09 +04:00
* __blk_end_request_all - Helper function for drives to finish the request .
* @ rq : the request to finish
2009-06-12 07:00:41 +04:00
* @ error : % 0 for success , < % 0 for error
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
*
* Description :
2009-05-11 12:56:09 +04:00
* Completely finish @ rq . Must be called with queue lock held .
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
*/
2009-05-11 12:56:09 +04:00
void __blk_end_request_all ( struct request * rq , int error )
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
{
2009-05-11 12:56:09 +04:00
bool pending ;
unsigned int bidi_bytes = 0 ;
if ( unlikely ( blk_bidi_rq ( rq ) ) )
bidi_bytes = blk_rq_bytes ( rq - > next_rq ) ;
pending = __blk_end_bidi_request ( rq , error , blk_rq_bytes ( rq ) , bidi_bytes ) ;
BUG_ON ( pending ) ;
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( __blk_end_request_all ) ;
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 18:45:09 +04:00
2007-12-12 01:51:02 +03:00
/**
2009-05-11 12:56:09 +04:00
* __blk_end_request_cur - Helper function to finish the current request chunk .
* @ rq : the request to finish the current chunk for
2009-06-12 07:00:41 +04:00
* @ error : % 0 for success , < % 0 for error
2007-12-12 01:51:02 +03:00
*
* Description :
2009-05-11 12:56:09 +04:00
* Complete the current consecutively mapped chunk from @ rq . Must
* be called with queue lock held .
2007-12-12 01:51:02 +03:00
*
* Return :
2009-05-11 12:56:09 +04:00
* % false - we are done with this request
* % true - still buffers pending for this request
*/
bool __blk_end_request_cur ( struct request * rq , int error )
2007-12-12 01:51:02 +03:00
{
2009-05-11 12:56:09 +04:00
return __blk_end_request ( rq , error , blk_rq_cur_bytes ( rq ) ) ;
2007-12-12 01:51:02 +03:00
}
2009-07-29 00:11:24 +04:00
EXPORT_SYMBOL ( __blk_end_request_cur ) ;
2007-12-12 01:51:02 +03:00
2009-07-03 12:48:17 +04:00
/**
* __blk_end_request_err - Finish a request till the next failure boundary .
* @ rq : the request to finish till the next failure boundary for
* @ error : must be negative errno
*
* Description :
* Complete @ rq till the next failure boundary . Must be called
* with queue lock held .
*
* Return :
* % false - we are done with this request
* % true - still buffers pending for this request
*/
bool __blk_end_request_err ( struct request * rq , int error )
{
WARN_ON ( error > = 0 ) ;
return __blk_end_request ( rq , error , blk_rq_err_bytes ( rq ) ) ;
}
EXPORT_SYMBOL_GPL ( __blk_end_request_err ) ;
2008-01-29 16:53:40 +03:00
void blk_rq_bio_prep ( struct request_queue * q , struct request * rq ,
struct bio * bio )
2005-04-17 02:20:36 +04:00
{
2009-07-03 12:48:16 +04:00
/* Bit 0 (R/W) is identical in rq->cmd_flags and bio->bi_rw */
2010-08-07 20:20:39 +04:00
rq - > cmd_flags | = bio - > bi_rw & REQ_WRITE ;
2005-04-17 02:20:36 +04:00
2014-04-10 19:46:28 +04:00
if ( bio_has_data ( bio ) )
2008-08-05 21:01:53 +04:00
rq - > nr_phys_segments = bio_phys_segments ( q , bio ) ;
2014-04-10 19:46:28 +04:00
2013-10-12 02:44:27 +04:00
rq - > __data_len = bio - > bi_iter . bi_size ;
2005-04-17 02:20:36 +04:00
rq - > bio = rq - > biotail = bio ;
2007-08-16 15:31:28 +04:00
if ( bio - > bi_bdev )
rq - > rq_disk = bio - > bi_bdev - > bd_disk ;
}
2005-04-17 02:20:36 +04:00
2009-11-26 11:16:19 +03:00
# if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
/**
* rq_flush_dcache_pages - Helper function to flush all pages in a request
* @ rq : the request to be flushed
*
* Description :
* Flush all pages in @ rq .
*/
void rq_flush_dcache_pages ( struct request * rq )
{
struct req_iterator iter ;
2013-11-24 05:19:00 +04:00
struct bio_vec bvec ;
2009-11-26 11:16:19 +03:00
rq_for_each_segment ( bvec , rq , iter )
2013-11-24 05:19:00 +04:00
flush_dcache_page ( bvec . bv_page ) ;
2009-11-26 11:16:19 +03:00
}
EXPORT_SYMBOL_GPL ( rq_flush_dcache_pages ) ;
# endif
2008-10-01 18:12:15 +04:00
/**
* blk_lld_busy - Check if underlying low - level drivers of a device are busy
* @ q : the queue of the device being checked
*
* Description :
* Check if underlying low - level drivers of a device are busy .
* If the drivers want to export their busy state , they must set own
* exporting function using blk_queue_lld_busy ( ) first .
*
* Basically , this function is used only by request stacking drivers
* to stop dispatching requests to underlying devices when underlying
* devices are busy . This behavior helps more I / O merging on the queue
* of the request stacking driver and prevents I / O throughput regression
* on burst I / O load .
*
* Return :
* 0 - Not busy ( The request stacking driver should dispatch request )
* 1 - Busy ( The request stacking driver should stop dispatching request )
*/
int blk_lld_busy ( struct request_queue * q )
{
if ( q - > lld_busy_fn )
return q - > lld_busy_fn ( q ) ;
return 0 ;
}
EXPORT_SYMBOL_GPL ( blk_lld_busy ) ;
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
/**
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
* @ rq : the clone request to be cleaned up
*
* Description :
* Free all bios in @ rq for a cloned request .
*/
void blk_rq_unprep_clone ( struct request * rq )
{
struct bio * bio ;
while ( ( bio = rq - > bio ) ! = NULL ) {
rq - > bio = bio - > bi_next ;
bio_put ( bio ) ;
}
}
EXPORT_SYMBOL_GPL ( blk_rq_unprep_clone ) ;
/*
* Copy attributes of the original request to the clone request .
2014-04-10 19:46:28 +04:00
* The actual data parts ( e . g . - > cmd , - > sense ) are not copied .
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
*/
static void __blk_rq_prep_clone ( struct request * dst , struct request * src )
{
dst - > cpu = src - > cpu ;
2010-09-03 13:56:18 +04:00
dst - > cmd_flags = ( src - > cmd_flags & REQ_CLONE_MASK ) | REQ_NOMERGE ;
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
dst - > cmd_type = src - > cmd_type ;
dst - > __sector = blk_rq_pos ( src ) ;
dst - > __data_len = blk_rq_bytes ( src ) ;
dst - > nr_phys_segments = src - > nr_phys_segments ;
dst - > ioprio = src - > ioprio ;
dst - > extra_len = src - > extra_len ;
}
/**
* blk_rq_prep_clone - Helper function to setup clone request
* @ rq : the request to be setup
* @ rq_src : original request to be cloned
* @ bs : bio_set that bios for clone are allocated from
* @ gfp_mask : memory allocation mask for bio
* @ bio_ctr : setup function to be called for each clone bio .
* Returns % 0 for success , non % 0 for failure .
* @ data : private data to be passed to @ bio_ctr
*
* Description :
* Clones bios in @ rq_src to @ rq , and copies attributes of @ rq_src to @ rq .
2014-04-10 19:46:28 +04:00
* The actual data parts of @ rq_src ( e . g . - > cmd , - > sense )
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
* are not copied , and copying such parts is the caller ' s responsibility .
* Also , pages which the original bios are pointing to are not copied
* and the cloned bios just point same pages .
* So cloned bios must be completed before original bios , which means
* the caller must complete @ rq before @ rq_src .
*/
int blk_rq_prep_clone ( struct request * rq , struct request * rq_src ,
struct bio_set * bs , gfp_t gfp_mask ,
int ( * bio_ctr ) ( struct bio * , struct bio * , void * ) ,
void * data )
{
struct bio * bio , * bio_src ;
if ( ! bs )
bs = fs_bio_set ;
blk_rq_init ( NULL , rq ) ;
__rq_for_each_bio ( bio_src , rq_src ) {
2014-10-04 01:27:11 +04:00
bio = bio_clone_fast ( bio_src , gfp_mask , bs ) ;
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
if ( ! bio )
goto free_and_out ;
if ( bio_ctr & & bio_ctr ( bio , bio_src , data ) )
goto free_and_out ;
if ( rq - > bio ) {
rq - > biotail - > bi_next = bio ;
rq - > biotail = bio ;
} else
rq - > bio = rq - > biotail = bio ;
}
__blk_rq_prep_clone ( rq , rq_src ) ;
return 0 ;
free_and_out :
if ( bio )
2012-09-07 02:35:00 +04:00
bio_put ( bio ) ;
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 15:10:16 +04:00
blk_rq_unprep_clone ( rq ) ;
return - ENOMEM ;
}
EXPORT_SYMBOL_GPL ( blk_rq_prep_clone ) ;
2014-04-08 19:15:35 +04:00
int kblockd_schedule_work ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
return queue_work ( kblockd_workqueue , work ) ;
}
EXPORT_SYMBOL ( kblockd_schedule_work ) ;
2014-04-08 19:15:35 +04:00
int kblockd_schedule_delayed_work ( struct delayed_work * dwork ,
unsigned long delay )
2010-09-16 01:06:35 +04:00
{
return queue_delayed_work ( kblockd_workqueue , dwork , delay ) ;
}
EXPORT_SYMBOL ( kblockd_schedule_delayed_work ) ;
2014-04-08 19:17:40 +04:00
int kblockd_schedule_delayed_work_on ( int cpu , struct delayed_work * dwork ,
unsigned long delay )
{
return queue_delayed_work_on ( cpu , kblockd_workqueue , dwork , delay ) ;
}
EXPORT_SYMBOL ( kblockd_schedule_delayed_work_on ) ;
2011-09-21 12:00:16 +04:00
/**
* blk_start_plug - initialize blk_plug and track it inside the task_struct
* @ plug : The & struct blk_plug that needs to be initialized
*
* Description :
* Tracking blk_plug inside the task_struct will help with auto - flushing the
* pending I / O should the task end up blocking between blk_start_plug ( ) and
* blk_finish_plug ( ) . This is important from a performance perspective , but
* also ensures that we don ' t deadlock . For instance , if the task is blocking
* for a memory allocation , memory reclaim could end up wanting to free a
* page belonging to that request that is currently residing in our private
* plug . By flushing the pending I / O when the process goes to sleep , we avoid
* this kind of deadlock .
*/
2011-03-08 15:19:51 +03:00
void blk_start_plug ( struct blk_plug * plug )
{
struct task_struct * tsk = current ;
INIT_LIST_HEAD ( & plug - > list ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
INIT_LIST_HEAD ( & plug - > mq_list ) ;
2011-04-18 11:52:22 +04:00
INIT_LIST_HEAD ( & plug - > cb_list ) ;
2011-03-08 15:19:51 +03:00
/*
* If this is a nested plug , don ' t actually assign it . It will be
* flushed on its own .
*/
if ( ! tsk - > plug ) {
/*
* Store ordering should not be needed here , since a potential
* preempt will imply a full memory barrier
*/
tsk - > plug = plug ;
}
}
EXPORT_SYMBOL ( blk_start_plug ) ;
static int plug_rq_cmp ( void * priv , struct list_head * a , struct list_head * b )
{
struct request * rqa = container_of ( a , struct request , queuelist ) ;
struct request * rqb = container_of ( b , struct request , queuelist ) ;
block: Add blk_rq_pos(rq) to sort rq when plushing
My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.
I used the blktrace to find those message:
8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
8,16 0 0 2.453853661 0 m N cfq2579 insert_request
8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453854439 0 m N cfq2579 insert_request
8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
8,16 0 0 2.454795160 0 m N cfq schedule dispatch
From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].
From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
CC:Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-25 23:58:17 +04:00
return ! ( rqa - > q < rqb - > q | |
( rqa - > q = = rqb - > q & & blk_rq_pos ( rqa ) < blk_rq_pos ( rqb ) ) ) ;
2011-03-08 15:19:51 +03:00
}
2011-04-16 15:51:05 +04:00
/*
* If ' from_schedule ' is true , then postpone the dispatch of requests
* until a safe kblockd context . We due this to avoid accidental big
* additional stack usage in driver dispatch , in places where the originally
* plugger did not intend it .
*/
2011-04-15 17:49:07 +04:00
static void queue_unplugged ( struct request_queue * q , unsigned int depth ,
2011-04-16 15:51:05 +04:00
bool from_schedule )
2011-04-18 11:59:55 +04:00
__releases ( q - > queue_lock )
2011-04-12 12:12:19 +04:00
{
2011-04-16 15:51:05 +04:00
trace_block_unplug ( q , depth , ! from_schedule ) ;
2011-04-18 11:59:55 +04:00
2012-11-28 16:45:56 +04:00
if ( from_schedule )
2011-04-18 13:41:33 +04:00
blk_run_queue_async ( q ) ;
2012-11-28 16:45:56 +04:00
else
2011-04-18 13:41:33 +04:00
__blk_run_queue ( q ) ;
2012-11-28 16:45:56 +04:00
spin_unlock ( q - > queue_lock ) ;
2011-04-12 12:12:19 +04:00
}
2012-07-31 11:08:15 +04:00
static void flush_plug_callbacks ( struct blk_plug * plug , bool from_schedule )
2011-04-18 11:52:22 +04:00
{
LIST_HEAD ( callbacks ) ;
2012-07-31 11:08:15 +04:00
while ( ! list_empty ( & plug - > cb_list ) ) {
list_splice_init ( & plug - > cb_list , & callbacks ) ;
2011-04-18 11:52:22 +04:00
2012-07-31 11:08:15 +04:00
while ( ! list_empty ( & callbacks ) ) {
struct blk_plug_cb * cb = list_first_entry ( & callbacks ,
2011-04-18 11:52:22 +04:00
struct blk_plug_cb ,
list ) ;
2012-07-31 11:08:15 +04:00
list_del ( & cb - > list ) ;
2012-07-31 11:08:15 +04:00
cb - > callback ( cb , from_schedule ) ;
2012-07-31 11:08:15 +04:00
}
2011-04-18 11:52:22 +04:00
}
}
2012-07-31 11:08:14 +04:00
struct blk_plug_cb * blk_check_plugged ( blk_plug_cb_fn unplug , void * data ,
int size )
{
struct blk_plug * plug = current - > plug ;
struct blk_plug_cb * cb ;
if ( ! plug )
return NULL ;
list_for_each_entry ( cb , & plug - > cb_list , list )
if ( cb - > callback = = unplug & & cb - > data = = data )
return cb ;
/* Not currently on the callback list */
BUG_ON ( size < sizeof ( * cb ) ) ;
cb = kzalloc ( size , GFP_ATOMIC ) ;
if ( cb ) {
cb - > data = data ;
cb - > callback = unplug ;
list_add ( & cb - > list , & plug - > cb_list ) ;
}
return cb ;
}
EXPORT_SYMBOL ( blk_check_plugged ) ;
2011-04-16 15:51:05 +04:00
void blk_flush_plug_list ( struct blk_plug * plug , bool from_schedule )
2011-03-08 15:19:51 +03:00
{
struct request_queue * q ;
unsigned long flags ;
struct request * rq ;
2011-04-11 16:13:10 +04:00
LIST_HEAD ( list ) ;
2011-04-12 12:12:19 +04:00
unsigned int depth ;
2011-03-08 15:19:51 +03:00
2012-07-31 11:08:15 +04:00
flush_plug_callbacks ( plug , from_schedule ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
if ( ! list_empty ( & plug - > mq_list ) )
blk_mq_flush_plug_list ( plug , from_schedule ) ;
2011-03-08 15:19:51 +03:00
if ( list_empty ( & plug - > list ) )
return ;
2011-04-11 16:13:10 +04:00
list_splice_init ( & plug - > list , & list ) ;
2013-01-11 17:46:09 +04:00
list_sort ( NULL , & list , plug_rq_cmp ) ;
2011-03-08 15:19:51 +03:00
q = NULL ;
2011-04-12 12:12:19 +04:00
depth = 0 ;
2011-04-12 12:11:24 +04:00
/*
* Save and disable interrupts here , to avoid doing it for every
* queue lock we have to take .
*/
2011-03-08 15:19:51 +03:00
local_irq_save ( flags ) ;
2011-04-11 16:13:10 +04:00
while ( ! list_empty ( & list ) ) {
rq = list_entry_rq ( list . next ) ;
2011-03-08 15:19:51 +03:00
list_del_init ( & rq - > queuelist ) ;
BUG_ON ( ! rq - > q ) ;
if ( rq - > q ! = q ) {
2011-04-18 11:59:55 +04:00
/*
* This drops the queue lock
*/
if ( q )
2011-04-16 15:51:05 +04:00
queue_unplugged ( q , depth , from_schedule ) ;
2011-03-08 15:19:51 +03:00
q = rq - > q ;
2011-04-12 12:12:19 +04:00
depth = 0 ;
2011-03-08 15:19:51 +03:00
spin_lock ( q - > queue_lock ) ;
}
2011-12-14 03:33:37 +04:00
/*
* Short - circuit if @ q is dead
*/
2012-11-28 16:42:38 +04:00
if ( unlikely ( blk_queue_dying ( q ) ) ) {
2011-12-14 03:33:37 +04:00
__blk_end_request_all ( rq , - ENODEV ) ;
continue ;
}
2011-03-08 15:19:51 +03:00
/*
* rq is already accounted , so use raw insert
*/
2011-03-25 18:57:52 +03:00
if ( rq - > cmd_flags & ( REQ_FLUSH | REQ_FUA ) )
__elv_add_request ( q , rq , ELEVATOR_INSERT_FLUSH ) ;
else
__elv_add_request ( q , rq , ELEVATOR_INSERT_SORT_MERGE ) ;
2011-04-12 12:12:19 +04:00
depth + + ;
2011-03-08 15:19:51 +03:00
}
2011-04-18 11:59:55 +04:00
/*
* This drops the queue lock
*/
if ( q )
2011-04-16 15:51:05 +04:00
queue_unplugged ( q , depth , from_schedule ) ;
2011-03-08 15:19:51 +03:00
local_irq_restore ( flags ) ;
}
void blk_finish_plug ( struct blk_plug * plug )
{
2011-04-15 17:49:07 +04:00
blk_flush_plug_list ( plug , false ) ;
2011-03-08 15:19:51 +03:00
2011-04-15 17:20:10 +04:00
if ( plug = = current - > plug )
current - > plug = NULL ;
2011-03-08 15:19:51 +03:00
}
2011-04-15 17:20:10 +04:00
EXPORT_SYMBOL ( blk_finish_plug ) ;
2011-03-08 15:19:51 +03:00
2013-03-23 07:42:26 +04:00
# ifdef CONFIG_PM_RUNTIME
/**
* blk_pm_runtime_init - Block layer runtime PM initialization routine
* @ q : the queue of the device
* @ dev : the device the queue belongs to
*
* Description :
* Initialize runtime - PM - related fields for @ q and start auto suspend for
* @ dev . Drivers that want to take advantage of request - based runtime PM
* should call this function after @ dev has been initialized , and its
* request queue @ q has been allocated , and runtime PM for it can not happen
* yet ( either due to disabled / forbidden or its usage_count > 0 ) . In most
* cases , driver should call this function before any I / O has taken place .
*
* This function takes care of setting up using auto suspend for the device ,
* the autosuspend delay is set to - 1 to make runtime suspend impossible
* until an updated value is either set by user or by driver . Drivers do
* not need to touch other autosuspend settings .
*
* The block layer runtime PM is request based , so only works for drivers
* that use request as their IO unit instead of those directly use bio ' s .
*/
void blk_pm_runtime_init ( struct request_queue * q , struct device * dev )
{
q - > dev = dev ;
q - > rpm_status = RPM_ACTIVE ;
pm_runtime_set_autosuspend_delay ( q - > dev , - 1 ) ;
pm_runtime_use_autosuspend ( q - > dev ) ;
}
EXPORT_SYMBOL ( blk_pm_runtime_init ) ;
/**
* blk_pre_runtime_suspend - Pre runtime suspend check
* @ q : the queue of the device
*
* Description :
* This function will check if runtime suspend is allowed for the device
* by examining if there are any requests pending in the queue . If there
* are requests pending , the device can not be runtime suspended ; otherwise ,
* the queue ' s status will be updated to SUSPENDING and the driver can
* proceed to suspend the device .
*
* For the not allowed case , we mark last busy for the device so that
* runtime PM core will try to autosuspend it some time later .
*
* This function should be called near the start of the device ' s
* runtime_suspend callback .
*
* Return :
* 0 - OK to runtime suspend the device
* - EBUSY - Device should not be runtime suspended
*/
int blk_pre_runtime_suspend ( struct request_queue * q )
{
int ret = 0 ;
spin_lock_irq ( q - > queue_lock ) ;
if ( q - > nr_pending ) {
ret = - EBUSY ;
pm_runtime_mark_last_busy ( q - > dev ) ;
} else {
q - > rpm_status = RPM_SUSPENDING ;
}
spin_unlock_irq ( q - > queue_lock ) ;
return ret ;
}
EXPORT_SYMBOL ( blk_pre_runtime_suspend ) ;
/**
* blk_post_runtime_suspend - Post runtime suspend processing
* @ q : the queue of the device
* @ err : return value of the device ' s runtime_suspend function
*
* Description :
* Update the queue ' s runtime status according to the return value of the
* device ' s runtime suspend function and mark last busy for the device so
* that PM core will try to auto suspend the device at a later time .
*
* This function should be called near the end of the device ' s
* runtime_suspend callback .
*/
void blk_post_runtime_suspend ( struct request_queue * q , int err )
{
spin_lock_irq ( q - > queue_lock ) ;
if ( ! err ) {
q - > rpm_status = RPM_SUSPENDED ;
} else {
q - > rpm_status = RPM_ACTIVE ;
pm_runtime_mark_last_busy ( q - > dev ) ;
}
spin_unlock_irq ( q - > queue_lock ) ;
}
EXPORT_SYMBOL ( blk_post_runtime_suspend ) ;
/**
* blk_pre_runtime_resume - Pre runtime resume processing
* @ q : the queue of the device
*
* Description :
* Update the queue ' s runtime status to RESUMING in preparation for the
* runtime resume of the device .
*
* This function should be called near the start of the device ' s
* runtime_resume callback .
*/
void blk_pre_runtime_resume ( struct request_queue * q )
{
spin_lock_irq ( q - > queue_lock ) ;
q - > rpm_status = RPM_RESUMING ;
spin_unlock_irq ( q - > queue_lock ) ;
}
EXPORT_SYMBOL ( blk_pre_runtime_resume ) ;
/**
* blk_post_runtime_resume - Post runtime resume processing
* @ q : the queue of the device
* @ err : return value of the device ' s runtime_resume function
*
* Description :
* Update the queue ' s runtime status according to the return value of the
* device ' s runtime_resume function . If it is successfully resumed , process
* the requests that are queued into the device ' s queue when it is resuming
* and then mark last busy and initiate autosuspend for it .
*
* This function should be called near the end of the device ' s
* runtime_resume callback .
*/
void blk_post_runtime_resume ( struct request_queue * q , int err )
{
spin_lock_irq ( q - > queue_lock ) ;
if ( ! err ) {
q - > rpm_status = RPM_ACTIVE ;
__blk_run_queue ( q ) ;
pm_runtime_mark_last_busy ( q - > dev ) ;
2013-05-17 11:47:20 +04:00
pm_request_autosuspend ( q - > dev ) ;
2013-03-23 07:42:26 +04:00
} else {
q - > rpm_status = RPM_SUSPENDED ;
}
spin_unlock_irq ( q - > queue_lock ) ;
}
EXPORT_SYMBOL ( blk_post_runtime_resume ) ;
# endif
2005-04-17 02:20:36 +04:00
int __init blk_dev_init ( void )
{
2009-04-27 16:53:54 +04:00
BUILD_BUG_ON ( __REQ_NR_BITS > 8 *
sizeof ( ( ( struct request * ) 0 ) - > cmd_flags ) ) ;
2011-01-03 17:01:47 +03:00
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
kblockd_workqueue = alloc_workqueue ( " kblockd " ,
2014-06-12 01:43:54 +04:00
WQ_MEM_RECLAIM | WQ_HIGHPRI , 0 ) ;
2005-04-17 02:20:36 +04:00
if ( ! kblockd_workqueue )
panic ( " Failed to create kblockd \n " ) ;
request_cachep = kmem_cache_create ( " blkdev_requests " ,
2007-07-20 05:11:58 +04:00
sizeof ( struct request ) , 0 , SLAB_PANIC , NULL ) ;
2005-04-17 02:20:36 +04:00
2008-01-29 16:51:59 +03:00
blk_requestq_cachep = kmem_cache_create ( " blkdev_queue " ,
2007-07-24 11:28:11 +04:00
sizeof ( struct request_queue ) , 0 , SLAB_PANIC , NULL ) ;
2005-04-17 02:20:36 +04:00
2008-01-24 10:53:35 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}