2008-01-29 14:51:59 +01:00
# ifndef BLK_INTERNAL_H
# define BLK_INTERNAL_H
2011-12-14 00:33:37 +01:00
# include <linux/idr.h>
2014-09-25 23:23:47 +08:00
# include <linux/blk-mq.h>
# include "blk-mq.h"
2011-12-14 00:33:37 +01:00
2008-01-29 14:53:40 +01:00
/* Amount of time in which a process may batch requests */
# define BLK_BATCH_TIME (HZ / 50UL)
/* Number of requests a "batching" process may submit */
# define BLK_BATCH_REQ 32
2014-05-13 15:10:52 -06:00
/* Max future timer expiry for timeouts */
# define BLK_MAX_TIMEOUT (5 * HZ)
2017-01-31 14:53:20 -08:00
# ifdef CONFIG_DEBUG_FS
extern struct dentry * blk_debugfs_root ;
# endif
2014-09-25 23:23:43 +08:00
struct blk_flush_queue {
unsigned int flush_queue_delayed : 1 ;
unsigned int flush_pending_idx : 1 ;
unsigned int flush_running_idx : 1 ;
unsigned long flush_pending_since ;
struct list_head flush_queue [ 2 ] ;
struct list_head flush_data_in_flight ;
struct request * flush_rq ;
2015-08-09 03:41:51 -04:00
/*
* flush_rq shares tag with this rq , both can ' t be active
* at the same time
*/
struct request * orig_rq ;
2014-09-25 23:23:43 +08:00
spinlock_t mq_flush_lock ;
} ;
2008-01-29 14:51:59 +01:00
extern struct kmem_cache * blk_requestq_cachep ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
extern struct kmem_cache * request_cachep ;
2008-01-29 14:51:59 +01:00
extern struct kobj_type blk_queue_ktype ;
2011-12-14 00:33:37 +01:00
extern struct ida blk_queue_ida ;
2008-01-29 14:51:59 +01:00
2014-09-25 23:23:43 +08:00
static inline struct blk_flush_queue * blk_get_flush_queue (
2014-09-25 23:23:46 +08:00
struct request_queue * q , struct blk_mq_ctx * ctx )
2014-09-25 23:23:43 +08:00
{
2016-09-14 16:18:54 +02:00
if ( q - > mq_ops )
return blk_mq_map_queue ( q , ctx - > cpu ) - > fq ;
return q - > fq ;
2014-09-25 23:23:43 +08:00
}
2011-12-14 00:33:38 +01:00
static inline void __blk_get_queue ( struct request_queue * q )
{
kobject_get ( & q - > kobj ) ;
}
2014-09-25 23:23:47 +08:00
struct blk_flush_queue * blk_alloc_flush_queue ( struct request_queue * q ,
int node , int cmd_size ) ;
void blk_free_flush_queue ( struct blk_flush_queue * q ) ;
2014-09-25 23:23:40 +08:00
2012-06-04 20:40:59 -07:00
int blk_init_rl ( struct request_list * rl , struct request_queue * q ,
gfp_t gfp_mask ) ;
void blk_exit_rl ( struct request_list * rl ) ;
2008-01-29 14:53:40 +01:00
void blk_rq_bio_prep ( struct request_queue * q , struct request * rq ,
struct bio * bio ) ;
2012-03-05 13:14:58 -08:00
void blk_queue_bypass_start ( struct request_queue * q ) ;
void blk_queue_bypass_end ( struct request_queue * q ) ;
2009-05-08 11:54:16 +09:00
void blk_dequeue_request ( struct request * rq ) ;
2008-01-29 14:51:59 +01:00
void __blk_queue_free_tags ( struct request_queue * q ) ;
2015-10-21 13:20:12 -04:00
void blk_freeze_queue ( struct request_queue * q ) ;
static inline void blk_queue_enter_live ( struct request_queue * q )
{
/*
* Given that running in generic_make_request ( ) context
* guarantees that a live reference against q_usage_counter has
* been established , further references under that same context
* need not check that the queue has been frozen ( marked dead ) .
*/
percpu_ref_get ( & q - > q_usage_counter ) ;
}
2008-01-29 14:51:59 +01:00
2015-10-21 13:20:23 -04:00
# ifdef CONFIG_BLK_DEV_INTEGRITY
void blk_flush_integrity ( void ) ;
# else
static inline void blk_flush_integrity ( void )
{
}
# endif
2008-01-29 14:51:59 +01:00
2015-10-30 20:57:30 +08:00
void blk_timeout_work ( struct work_struct * work ) ;
2014-05-13 15:10:52 -06:00
unsigned long blk_rq_timeout ( unsigned long timeout ) ;
2014-04-24 08:51:47 -06:00
void blk_add_timer ( struct request * req ) ;
2008-09-14 05:55:09 -07:00
void blk_delete_timer ( struct request * ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
bool bio_attempt_front_merge ( struct request_queue * q , struct request * req ,
struct bio * bio ) ;
bool bio_attempt_back_merge ( struct request_queue * q , struct request * req ,
struct bio * bio ) ;
2017-02-08 14:46:49 +01:00
bool bio_attempt_discard_merge ( struct request_queue * q , struct request * req ,
struct bio * bio ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
bool blk_attempt_plug_merge ( struct request_queue * q , struct bio * bio ,
2015-05-08 10:51:33 -07:00
unsigned int * request_count ,
struct request * * same_queue_rq ) ;
2015-10-20 23:13:51 +08:00
unsigned int blk_plug_queued_count ( struct request_queue * q ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
void blk_account_io_start ( struct request * req , bool new_io ) ;
void blk_account_io_completion ( struct request * req , unsigned int bytes ) ;
void blk_account_io_done ( struct request * req ) ;
2008-09-14 05:55:09 -07:00
/*
* Internal atomic flags for request handling
*/
enum rq_atomic_flags {
REQ_ATOM_COMPLETE = 0 ,
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
REQ_ATOM_STARTED ,
2016-11-14 13:01:59 -07:00
REQ_ATOM_POLL_SLEPT ,
2008-09-14 05:55:09 -07:00
} ;
/*
* EH timer and IO completion will both attempt to ' grab ' the request , make
2011-03-30 22:57:33 -03:00
* sure that only one of them succeeds
2008-09-14 05:55:09 -07:00
*/
static inline int blk_mark_rq_complete ( struct request * rq )
{
return test_and_set_bit ( REQ_ATOM_COMPLETE , & rq - > atomic_flags ) ;
}
static inline void blk_clear_rq_complete ( struct request * rq )
{
clear_bit ( REQ_ATOM_COMPLETE , & rq - > atomic_flags ) ;
}
2008-01-29 14:53:40 +01:00
2009-04-23 11:05:18 +09:00
/*
* Internal elevator interface
*/
2016-10-20 15:12:13 +02:00
# define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
2009-04-23 11:05:18 +09:00
2011-01-25 12:43:54 +01:00
void blk_insert_flush ( struct request * rq ) ;
2010-09-03 11:56:16 +02:00
2009-04-23 11:05:18 +09:00
static inline struct request * __elv_next_request ( struct request_queue * q )
{
struct request * rq ;
2014-09-25 23:23:46 +08:00
struct blk_flush_queue * fq = blk_get_flush_queue ( q , NULL ) ;
2009-04-23 11:05:18 +09:00
while ( 1 ) {
2011-01-25 12:43:54 +01:00
if ( ! list_empty ( & q - > queue_head ) ) {
2009-04-23 11:05:18 +09:00
rq = list_entry_rq ( q - > queue_head . next ) ;
2011-01-25 12:43:54 +01:00
return rq ;
2009-04-23 11:05:18 +09:00
}
block: hold queue if flush is running for non-queueable flush drive
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 11:34:41 -06:00
/*
* Flush request is running and flush request isn ' t queueable
* in the drive , we can hold the queue till flush request is
* finished . Even we don ' t do this , driver can ' t dispatch next
* requests and will requeue them . And this can improve
* throughput too . For example , we have request flush1 , write1 ,
* flush 2. flush1 is dispatched , then queue is hold , write1
* isn ' t inserted to queue . After flush1 is finished , flush2
* will be dispatched . Since disk cache is already clean ,
* flush2 will be finished very soon , so looks like flush2 is
* folded to flush1 .
* Since the queue is hold , a flag is set to indicate the queue
* should be restarted later . Please see flush_end_io ( ) for
* details .
*/
2014-09-25 23:23:43 +08:00
if ( fq - > flush_pending_idx ! = fq - > flush_running_idx & &
block: hold queue if flush is running for non-queueable flush drive
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 11:34:41 -06:00
! queue_flush_queueable ( q ) ) {
2014-09-25 23:23:43 +08:00
fq - > flush_queue_delayed = 1 ;
block: hold queue if flush is running for non-queueable flush drive
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 11:34:41 -06:00
return NULL ;
}
2014-01-29 14:56:16 -07:00
if ( unlikely ( blk_queue_bypass ( q ) ) | |
2016-12-10 15:13:59 -07:00
! q - > elevator - > type - > ops . sq . elevator_dispatch_fn ( q , 0 ) )
2009-04-23 11:05:18 +09:00
return NULL ;
}
}
static inline void elv_activate_rq ( struct request_queue * q , struct request * rq )
{
struct elevator_queue * e = q - > elevator ;
2016-12-10 15:13:59 -07:00
if ( e - > type - > ops . sq . elevator_activate_req_fn )
e - > type - > ops . sq . elevator_activate_req_fn ( q , rq ) ;
2009-04-23 11:05:18 +09:00
}
static inline void elv_deactivate_rq ( struct request_queue * q , struct request * rq )
{
struct elevator_queue * e = q - > elevator ;
2016-12-10 15:13:59 -07:00
if ( e - > type - > ops . sq . elevator_deactivate_req_fn )
e - > type - > ops . sq . elevator_deactivate_req_fn ( q , rq ) ;
2009-04-23 11:05:18 +09:00
}
2008-09-14 05:56:33 -07:00
# ifdef CONFIG_FAIL_IO_TIMEOUT
int blk_should_fake_timeout ( struct request_queue * ) ;
ssize_t part_timeout_show ( struct device * , struct device_attribute * , char * ) ;
ssize_t part_timeout_store ( struct device * , struct device_attribute * ,
const char * , size_t ) ;
# else
static inline int blk_should_fake_timeout ( struct request_queue * q )
{
return 0 ;
}
# endif
2008-01-29 14:04:06 +01:00
int ll_back_merge_fn ( struct request_queue * q , struct request * req ,
struct bio * bio ) ;
int ll_front_merge_fn ( struct request_queue * q , struct request * req ,
struct bio * bio ) ;
2017-02-02 08:54:40 -07:00
struct request * attempt_back_merge ( struct request_queue * q , struct request * rq ) ;
struct request * attempt_front_merge ( struct request_queue * q , struct request * rq ) ;
2011-03-21 10:14:27 +01:00
int blk_attempt_req_merge ( struct request_queue * q , struct request * rq ,
struct request * next ) ;
2008-01-29 14:04:06 +01:00
void blk_recalc_rq_segments ( struct request * rq ) ;
2009-07-03 17:48:17 +09:00
void blk_rq_set_mixed_merge ( struct request * rq ) ;
2012-02-08 09:19:38 +01:00
bool blk_rq_merge_ok ( struct request * rq , struct bio * bio ) ;
2017-02-08 14:46:48 +01:00
enum elv_merge blk_try_merge ( struct request * rq , struct bio * bio ) ;
2008-01-29 14:04:06 +01:00
2008-01-29 14:51:59 +01:00
void blk_queue_congestion_threshold ( struct request_queue * q ) ;
2008-03-04 11:23:45 +01:00
int blk_dev_init ( void ) ;
2010-10-24 22:06:02 +02:00
2008-01-29 14:51:59 +01:00
/*
* Return the threshold ( number of used requests ) at which the queue is
* considered to be congested . It include a little hysteresis to keep the
* context switch rate down .
*/
static inline int queue_congestion_on_threshold ( struct request_queue * q )
{
return q - > nr_congestion_on ;
}
/*
* The threshold at which a queue is considered to be uncongested
*/
static inline int queue_congestion_off_threshold ( struct request_queue * q )
{
return q - > nr_congestion_off ;
}
2014-05-20 11:49:02 -06:00
extern int blk_update_nr_requests ( struct request_queue * , unsigned int ) ;
2009-04-24 08:10:11 +02:00
/*
* Contribute to IO statistics IFF :
*
* a ) it ' s attached to a gendisk , and
* b ) the queue had IO stats enabled when this request was started , and
2012-09-18 12:19:25 -04:00
* c ) it ' s a file system request
2009-04-24 08:10:11 +02:00
*/
2009-03-27 10:31:51 +01:00
static inline int blk_do_io_stat ( struct request * rq )
2009-02-02 08:42:32 +01:00
{
2010-08-07 18:17:56 +02:00
return rq - > rq_disk & &
2016-10-20 15:12:13 +02:00
( rq - > rq_flags & RQF_IO_STAT ) & &
2017-01-31 16:57:29 +01:00
! blk_rq_is_passthrough ( rq ) ;
2009-02-02 08:42:32 +01:00
}
2017-02-08 14:46:47 +01:00
static inline void req_set_nomerge ( struct request_queue * q , struct request * req )
{
req - > cmd_flags | = REQ_NOMERGE ;
if ( req = = q - > last_merge )
q - > last_merge = NULL ;
}
2011-12-14 00:33:40 +01:00
/*
* Internal io_context interface
*/
void get_io_context ( struct io_context * ioc ) ;
2011-12-14 00:33:42 +01:00
struct io_cq * ioc_lookup_icq ( struct io_context * ioc , struct request_queue * q ) ;
2012-03-05 13:15:24 -08:00
struct io_cq * ioc_create_icq ( struct io_context * ioc , struct request_queue * q ,
gfp_t gfp_mask ) ;
2011-12-14 00:33:42 +01:00
void ioc_clear_queue ( struct request_queue * q ) ;
2011-12-14 00:33:40 +01:00
2012-03-05 13:15:24 -08:00
int create_task_io_context ( struct task_struct * task , gfp_t gfp_mask , int node ) ;
2011-12-14 00:33:40 +01:00
2016-12-14 14:23:43 -07:00
/**
* rq_ioc - determine io_context for request allocation
* @ bio : request being allocated is for this bio ( can be % NULL )
*
* Determine io_context to use for request allocation for @ bio . May return
* % NULL if % current - > io_context doesn ' t exist .
*/
static inline struct io_context * rq_ioc ( struct bio * bio )
{
# ifdef CONFIG_BLK_CGROUP
if ( bio & & bio - > bi_ioc )
return bio - > bi_ioc ;
# endif
return current - > io_context ;
}
2011-12-14 00:33:40 +01:00
/**
* create_io_context - try to create task - > io_context
* @ gfp_mask : allocation mask
* @ node : allocation node
*
2012-03-05 13:15:24 -08:00
* If % current - > io_context is % NULL , allocate a new io_context and install
* it . Returns the current % current - > io_context which may be % NULL if
* allocation failed .
2011-12-14 00:33:40 +01:00
*
* Note that this function can ' t be called with IRQ disabled because
2012-03-05 13:15:24 -08:00
* task_lock which protects % current - > io_context is IRQ - unsafe .
2011-12-14 00:33:40 +01:00
*/
2012-03-05 13:15:24 -08:00
static inline struct io_context * create_io_context ( gfp_t gfp_mask , int node )
2011-12-14 00:33:40 +01:00
{
WARN_ON_ONCE ( irqs_disabled ( ) ) ;
2012-03-05 13:15:24 -08:00
if ( unlikely ( ! current - > io_context ) )
create_task_io_context ( current , gfp_mask , node ) ;
return current - > io_context ;
2011-12-14 00:33:40 +01:00
}
/*
* Internal throttling interface
*/
2011-10-19 14:31:18 +02:00
# ifdef CONFIG_BLK_DEV_THROTTLING
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:42:16 +02:00
extern void blk_throtl_drain ( struct request_queue * q ) ;
2011-10-19 14:31:18 +02:00
extern int blk_throtl_init ( struct request_queue * q ) ;
extern void blk_throtl_exit ( struct request_queue * q ) ;
2017-03-27 10:51:38 -07:00
extern void blk_throtl_register_queue ( struct request_queue * q ) ;
2011-10-19 14:31:18 +02:00
# else /* CONFIG_BLK_DEV_THROTTLING */
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:42:16 +02:00
static inline void blk_throtl_drain ( struct request_queue * q ) { }
2011-10-19 14:31:18 +02:00
static inline int blk_throtl_init ( struct request_queue * q ) { return 0 ; }
static inline void blk_throtl_exit ( struct request_queue * q ) { }
2017-03-27 10:51:38 -07:00
static inline void blk_throtl_register_queue ( struct request_queue * q ) { }
2011-10-19 14:31:18 +02:00
# endif /* CONFIG_BLK_DEV_THROTTLING */
2017-03-27 10:51:37 -07:00
# ifdef CONFIG_BLK_DEV_THROTTLING_LOW
extern ssize_t blk_throtl_sample_time_show ( struct request_queue * q , char * page ) ;
extern ssize_t blk_throtl_sample_time_store ( struct request_queue * q ,
const char * page , size_t count ) ;
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 10:51:41 -07:00
extern void blk_throtl_bio_endio ( struct bio * bio ) ;
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 15:19:42 -07:00
extern void blk_throtl_stat_add ( struct request * rq , u64 time ) ;
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 10:51:41 -07:00
# else
static inline void blk_throtl_bio_endio ( struct bio * bio ) { }
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 15:19:42 -07:00
static inline void blk_throtl_stat_add ( struct request * rq , u64 time ) { }
2017-03-27 10:51:37 -07:00
# endif
2011-10-19 14:31:18 +02:00
# endif /* BLK_INTERNAL_H */