2008-01-29 16:51:59 +03:00
/*
* Functions related to sysfs handling
*/
# include <linux/kernel.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2008-01-29 16:51:59 +03:00
# include <linux/module.h>
# include <linux/bio.h>
# include <linux/blkdev.h>
2015-05-23 00:13:32 +03:00
# include <linux/backing-dev.h>
2008-01-29 16:51:59 +03:00
# include <linux/blktrace_api.h>
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
# include <linux/blk-mq.h>
2015-05-23 00:13:17 +03:00
# include <linux/blk-cgroup.h>
2008-01-29 16:51:59 +03:00
# include "blk.h"
2013-12-26 17:31:38 +04:00
# include "blk-mq.h"
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
# include "blk-wbt.h"
2008-01-29 16:51:59 +03:00
struct queue_sysfs_entry {
struct attribute attr ;
ssize_t ( * show ) ( struct request_queue * , char * ) ;
ssize_t ( * store ) ( struct request_queue * , const char * , size_t ) ;
} ;
static ssize_t
2009-07-17 11:26:26 +04:00
queue_var_show ( unsigned long var , char * page )
2008-01-29 16:51:59 +03:00
{
2009-07-17 11:26:26 +04:00
return sprintf ( page , " %lu \n " , var ) ;
2008-01-29 16:51:59 +03:00
}
static ssize_t
queue_var_store ( unsigned long * var , const char * page , size_t count )
{
2012-09-08 19:55:45 +04:00
int err ;
unsigned long v ;
2013-09-12 01:20:08 +04:00
err = kstrtoul ( page , 10 , & v ) ;
2012-09-08 19:55:45 +04:00
if ( err | | v > UINT_MAX )
return - EINVAL ;
* var = v ;
2008-01-29 16:51:59 +03:00
return count ;
}
2016-11-28 19:22:47 +03:00
static ssize_t queue_var_store64 ( s64 * var , const char * page )
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
{
int err ;
2016-11-28 19:22:47 +03:00
s64 v ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
2016-11-28 19:22:47 +03:00
err = kstrtos64 ( page , 10 , & v ) ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
if ( err < 0 )
return err ;
* var = v ;
return 0 ;
}
2008-01-29 16:51:59 +03:00
static ssize_t queue_requests_show ( struct request_queue * q , char * page )
{
return queue_var_show ( q - > nr_requests , ( page ) ) ;
}
static ssize_t
queue_requests_store ( struct request_queue * q , const char * page , size_t count )
{
unsigned long nr ;
2014-05-20 21:49:02 +04:00
int ret , err ;
2009-09-12 00:44:29 +04:00
2014-05-20 21:49:02 +04:00
if ( ! q - > request_fn & & ! q - > mq_ops )
2009-09-12 00:44:29 +04:00
return - EINVAL ;
ret = queue_var_store ( & nr , page , count ) ;
2012-09-08 19:55:45 +04:00
if ( ret < 0 )
return ret ;
2008-01-29 16:51:59 +03:00
if ( nr < BLKDEV_MIN_RQ )
nr = BLKDEV_MIN_RQ ;
2014-05-20 21:49:02 +04:00
if ( q - > request_fn )
err = blk_update_nr_requests ( q , nr ) ;
else
err = blk_mq_update_nr_requests ( q , nr ) ;
if ( err )
return err ;
2008-01-29 16:51:59 +03:00
return ret ;
}
static ssize_t queue_ra_show ( struct request_queue * q , char * page )
{
2009-07-17 11:26:26 +04:00
unsigned long ra_kb = q - > backing_dev_info . ra_pages < <
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
( PAGE_SHIFT - 10 ) ;
2008-01-29 16:51:59 +03:00
return queue_var_show ( ra_kb , ( page ) ) ;
}
static ssize_t
queue_ra_store ( struct request_queue * q , const char * page , size_t count )
{
unsigned long ra_kb ;
ssize_t ret = queue_var_store ( & ra_kb , page , count ) ;
2012-09-08 19:55:45 +04:00
if ( ret < 0 )
return ret ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
q - > backing_dev_info . ra_pages = ra_kb > > ( PAGE_SHIFT - 10 ) ;
2008-01-29 16:51:59 +03:00
return ret ;
}
static ssize_t queue_max_sectors_show ( struct request_queue * q , char * page )
{
2009-05-23 01:17:50 +04:00
int max_sectors_kb = queue_max_sectors ( q ) > > 1 ;
2008-01-29 16:51:59 +03:00
return queue_var_show ( max_sectors_kb , ( page ) ) ;
}
2010-03-10 08:48:33 +03:00
static ssize_t queue_max_segments_show ( struct request_queue * q , char * page )
{
return queue_var_show ( queue_max_segments ( q ) , ( page ) ) ;
}
2010-09-10 22:50:10 +04:00
static ssize_t queue_max_integrity_segments_show ( struct request_queue * q , char * page )
{
return queue_var_show ( q - > limits . max_integrity_segments , ( page ) ) ;
}
2010-03-10 08:48:33 +03:00
static ssize_t queue_max_segment_size_show ( struct request_queue * q , char * page )
{
2010-12-01 21:41:49 +03:00
if ( blk_queue_cluster ( q ) )
2010-03-10 08:48:33 +03:00
return queue_var_show ( queue_max_segment_size ( q ) , ( page ) ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
return queue_var_show ( PAGE_SIZE , ( page ) ) ;
2010-03-10 08:48:33 +03:00
}
2009-05-23 01:17:49 +04:00
static ssize_t queue_logical_block_size_show ( struct request_queue * q , char * page )
2008-01-29 21:14:08 +03:00
{
2009-05-23 01:17:49 +04:00
return queue_var_show ( queue_logical_block_size ( q ) , page ) ;
2008-01-29 21:14:08 +03:00
}
2009-05-23 01:17:53 +04:00
static ssize_t queue_physical_block_size_show ( struct request_queue * q , char * page )
{
return queue_var_show ( queue_physical_block_size ( q ) , page ) ;
}
2016-10-18 09:40:30 +03:00
static ssize_t queue_chunk_sectors_show ( struct request_queue * q , char * page )
{
return queue_var_show ( q - > limits . chunk_sectors , page ) ;
}
2009-05-23 01:17:53 +04:00
static ssize_t queue_io_min_show ( struct request_queue * q , char * page )
{
return queue_var_show ( queue_io_min ( q ) , page ) ;
}
static ssize_t queue_io_opt_show ( struct request_queue * q , char * page )
{
return queue_var_show ( queue_io_opt ( q ) , page ) ;
2008-01-29 21:14:08 +03:00
}
2009-11-10 13:50:21 +03:00
static ssize_t queue_discard_granularity_show ( struct request_queue * q , char * page )
{
return queue_var_show ( q - > limits . discard_granularity , page ) ;
}
2015-07-16 18:14:26 +03:00
static ssize_t queue_discard_max_hw_show ( struct request_queue * q , char * page )
{
2016-02-17 17:15:30 +03:00
return sprintf ( page , " %llu \n " ,
( unsigned long long ) q - > limits . max_hw_discard_sectors < < 9 ) ;
2015-07-16 18:14:26 +03:00
}
2009-11-10 13:50:21 +03:00
static ssize_t queue_discard_max_show ( struct request_queue * q , char * page )
{
2011-05-18 12:37:35 +04:00
return sprintf ( page , " %llu \n " ,
( unsigned long long ) q - > limits . max_discard_sectors < < 9 ) ;
2009-11-10 13:50:21 +03:00
}
2015-07-16 18:14:26 +03:00
static ssize_t queue_discard_max_store ( struct request_queue * q ,
const char * page , size_t count )
{
unsigned long max_discard ;
ssize_t ret = queue_var_store ( & max_discard , page , count ) ;
if ( ret < 0 )
return ret ;
if ( max_discard & ( q - > limits . discard_granularity - 1 ) )
return - EINVAL ;
max_discard > > = 9 ;
if ( max_discard > UINT_MAX )
return - EINVAL ;
if ( max_discard > q - > limits . max_hw_discard_sectors )
max_discard = q - > limits . max_hw_discard_sectors ;
q - > limits . max_discard_sectors = max_discard ;
return ret ;
}
2009-12-03 11:24:48 +03:00
static ssize_t queue_discard_zeroes_data_show ( struct request_queue * q , char * page )
{
return queue_var_show ( queue_discard_zeroes_data ( q ) , page ) ;
}
2012-09-18 20:19:27 +04:00
static ssize_t queue_write_same_max_show ( struct request_queue * q , char * page )
{
return sprintf ( page , " %llu \n " ,
( unsigned long long ) q - > limits . max_write_same_sectors < < 9 ) ;
}
2016-11-30 23:28:59 +03:00
static ssize_t queue_write_zeroes_max_show ( struct request_queue * q , char * page )
{
return sprintf ( page , " %llu \n " ,
( unsigned long long ) q - > limits . max_write_zeroes_sectors < < 9 ) ;
}
2012-09-18 20:19:27 +04:00
2008-01-29 16:51:59 +03:00
static ssize_t
queue_max_sectors_store ( struct request_queue * q , const char * page , size_t count )
{
unsigned long max_sectors_kb ,
2009-05-23 01:17:50 +04:00
max_hw_sectors_kb = queue_max_hw_sectors ( q ) > > 1 ,
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
page_kb = 1 < < ( PAGE_SHIFT - 10 ) ;
2008-01-29 16:51:59 +03:00
ssize_t ret = queue_var_store ( & max_sectors_kb , page , count ) ;
2012-09-08 19:55:45 +04:00
if ( ret < 0 )
return ret ;
2015-11-14 00:46:48 +03:00
max_hw_sectors_kb = min_not_zero ( max_hw_sectors_kb , ( unsigned long )
q - > limits . max_dev_sectors > > 1 ) ;
2008-01-29 16:51:59 +03:00
if ( max_sectors_kb > max_hw_sectors_kb | | max_sectors_kb < page_kb )
return - EINVAL ;
2008-11-25 11:08:39 +03:00
2008-01-29 16:51:59 +03:00
spin_lock_irq ( q - > queue_lock ) ;
2009-09-02 00:40:15 +04:00
q - > limits . max_sectors = max_sectors_kb < < 1 ;
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 03:43:26 +03:00
q - > backing_dev_info . io_pages = max_sectors_kb > > ( PAGE_SHIFT - 10 ) ;
2008-01-29 16:51:59 +03:00
spin_unlock_irq ( q - > queue_lock ) ;
return ret ;
}
static ssize_t queue_max_hw_sectors_show ( struct request_queue * q , char * page )
{
2009-05-23 01:17:50 +04:00
int max_hw_sectors_kb = queue_max_hw_sectors ( q ) > > 1 ;
2008-01-29 16:51:59 +03:00
return queue_var_show ( max_hw_sectors_kb , ( page ) ) ;
}
2010-08-07 20:13:50 +04:00
# define QUEUE_SYSFS_BIT_FNS(name, flag, neg) \
static ssize_t \
queue_show_ # # name ( struct request_queue * q , char * page ) \
{ \
int bit ; \
bit = test_bit ( QUEUE_FLAG_ # # flag , & q - > queue_flags ) ; \
return queue_var_show ( neg ? ! bit : bit , page ) ; \
} \
static ssize_t \
queue_store_ # # name ( struct request_queue * q , const char * page , size_t count ) \
{ \
unsigned long val ; \
ssize_t ret ; \
ret = queue_var_store ( & val , page , count ) ; \
2013-04-03 23:53:57 +04:00
if ( ret < 0 ) \
return ret ; \
2010-08-07 20:13:50 +04:00
if ( neg ) \
val = ! val ; \
\
spin_lock_irq ( q - > queue_lock ) ; \
if ( val ) \
queue_flag_set ( QUEUE_FLAG_ # # flag , q ) ; \
else \
queue_flag_clear ( QUEUE_FLAG_ # # flag , q ) ; \
spin_unlock_irq ( q - > queue_lock ) ; \
return ret ; \
2009-01-07 14:22:39 +03:00
}
2010-08-07 20:13:50 +04:00
QUEUE_SYSFS_BIT_FNS ( nonrot , NONROT , 1 ) ;
QUEUE_SYSFS_BIT_FNS ( random , ADD_RANDOM , 0 ) ;
QUEUE_SYSFS_BIT_FNS ( iostats , IO_STAT , 0 ) ;
# undef QUEUE_SYSFS_BIT_FNS
2009-01-07 14:22:39 +03:00
2016-10-18 09:40:29 +03:00
static ssize_t queue_zoned_show ( struct request_queue * q , char * page )
{
switch ( blk_queue_zoned_model ( q ) ) {
case BLK_ZONED_HA :
return sprintf ( page , " host-aware \n " ) ;
case BLK_ZONED_HM :
return sprintf ( page , " host-managed \n " ) ;
default :
return sprintf ( page , " none \n " ) ;
}
}
2008-04-29 16:44:19 +04:00
static ssize_t queue_nomerges_show ( struct request_queue * q , char * page )
{
2010-01-29 11:04:08 +03:00
return queue_var_show ( ( blk_queue_nomerges ( q ) < < 1 ) |
blk_queue_noxmerges ( q ) , page ) ;
2008-04-29 16:44:19 +04:00
}
static ssize_t queue_nomerges_store ( struct request_queue * q , const char * page ,
size_t count )
{
unsigned long nm ;
ssize_t ret = queue_var_store ( & nm , page , count ) ;
2012-09-08 19:55:45 +04:00
if ( ret < 0 )
return ret ;
2008-05-07 11:09:39 +04:00
spin_lock_irq ( q - > queue_lock ) ;
2010-01-29 11:04:08 +03:00
queue_flag_clear ( QUEUE_FLAG_NOMERGES , q ) ;
queue_flag_clear ( QUEUE_FLAG_NOXMERGES , q ) ;
if ( nm = = 2 )
2008-05-07 11:09:39 +04:00
queue_flag_set ( QUEUE_FLAG_NOMERGES , q ) ;
2010-01-29 11:04:08 +03:00
else if ( nm )
queue_flag_set ( QUEUE_FLAG_NOXMERGES , q ) ;
2008-05-07 11:09:39 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
2009-01-07 14:22:39 +03:00
2008-04-29 16:44:19 +04:00
return ret ;
}
2008-09-13 22:26:01 +04:00
static ssize_t queue_rq_affinity_show ( struct request_queue * q , char * page )
{
2009-07-17 11:26:26 +04:00
bool set = test_bit ( QUEUE_FLAG_SAME_COMP , & q - > queue_flags ) ;
2011-07-23 22:44:25 +04:00
bool force = test_bit ( QUEUE_FLAG_SAME_FORCE , & q - > queue_flags ) ;
2008-09-13 22:26:01 +04:00
2011-07-23 22:44:25 +04:00
return queue_var_show ( set < < force , page ) ;
2008-09-13 22:26:01 +04:00
}
static ssize_t
queue_rq_affinity_store ( struct request_queue * q , const char * page , size_t count )
{
ssize_t ret = - EINVAL ;
2013-11-15 02:32:07 +04:00
# ifdef CONFIG_SMP
2008-09-13 22:26:01 +04:00
unsigned long val ;
ret = queue_var_store ( & val , page , count ) ;
2012-09-08 19:55:45 +04:00
if ( ret < 0 )
return ret ;
2008-09-13 22:26:01 +04:00
spin_lock_irq ( q - > queue_lock ) ;
2011-08-23 23:25:12 +04:00
if ( val = = 2 ) {
2008-09-13 22:26:01 +04:00
queue_flag_set ( QUEUE_FLAG_SAME_COMP , q ) ;
2011-08-23 23:25:12 +04:00
queue_flag_set ( QUEUE_FLAG_SAME_FORCE , q ) ;
} else if ( val = = 1 ) {
queue_flag_set ( QUEUE_FLAG_SAME_COMP , q ) ;
queue_flag_clear ( QUEUE_FLAG_SAME_FORCE , q ) ;
} else if ( val = = 0 ) {
2011-07-23 22:44:25 +04:00
queue_flag_clear ( QUEUE_FLAG_SAME_COMP , q ) ;
queue_flag_clear ( QUEUE_FLAG_SAME_FORCE , q ) ;
}
2008-09-13 22:26:01 +04:00
spin_unlock_irq ( q - > queue_lock ) ;
# endif
return ret ;
}
2008-01-29 16:51:59 +03:00
2016-11-14 23:01:59 +03:00
static ssize_t queue_poll_delay_show ( struct request_queue * q , char * page )
{
2016-11-14 23:03:03 +03:00
int val ;
if ( q - > poll_nsec = = - 1 )
val = - 1 ;
else
val = q - > poll_nsec / 1000 ;
return sprintf ( page , " %d \n " , val ) ;
2016-11-14 23:01:59 +03:00
}
static ssize_t queue_poll_delay_store ( struct request_queue * q , const char * page ,
size_t count )
{
2016-11-14 23:03:03 +03:00
int err , val ;
2016-11-14 23:01:59 +03:00
if ( ! q - > mq_ops | | ! q - > mq_ops - > poll )
return - EINVAL ;
2016-11-14 23:03:03 +03:00
err = kstrtoint ( page , 10 , & val ) ;
if ( err < 0 )
return err ;
2016-11-14 23:01:59 +03:00
2016-11-14 23:03:03 +03:00
if ( val = = - 1 )
q - > poll_nsec = - 1 ;
else
q - > poll_nsec = val * 1000 ;
return count ;
2016-11-14 23:01:59 +03:00
}
2015-11-05 20:44:55 +03:00
static ssize_t queue_poll_show ( struct request_queue * q , char * page )
{
return queue_var_show ( test_bit ( QUEUE_FLAG_POLL , & q - > queue_flags ) , page ) ;
}
static ssize_t queue_poll_store ( struct request_queue * q , const char * page ,
size_t count )
{
unsigned long poll_on ;
ssize_t ret ;
if ( ! q - > mq_ops | | ! q - > mq_ops - > poll )
return - EINVAL ;
ret = queue_var_store ( & poll_on , page , count ) ;
if ( ret < 0 )
return ret ;
spin_lock_irq ( q - > queue_lock ) ;
if ( poll_on )
queue_flag_set ( QUEUE_FLAG_POLL , q ) ;
else
queue_flag_clear ( QUEUE_FLAG_POLL , q ) ;
spin_unlock_irq ( q - > queue_lock ) ;
return ret ;
}
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
static ssize_t queue_wb_lat_show ( struct request_queue * q , char * page )
{
if ( ! q - > rq_wb )
return - EINVAL ;
return sprintf ( page , " %llu \n " , div_u64 ( q - > rq_wb - > min_lat_nsec , 1000 ) ) ;
}
static ssize_t queue_wb_lat_store ( struct request_queue * q , const char * page ,
size_t count )
{
2016-11-28 19:22:47 +03:00
struct rq_wb * rwb ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
ssize_t ret ;
2016-11-28 19:22:47 +03:00
s64 val ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
ret = queue_var_store64 ( & val , page ) ;
if ( ret < 0 )
return ret ;
2016-11-28 19:40:34 +03:00
if ( val < - 1 )
return - EINVAL ;
rwb = q - > rq_wb ;
if ( ! rwb ) {
ret = wbt_init ( q ) ;
if ( ret )
return ret ;
rwb = q - > rq_wb ;
if ( ! rwb )
return - EINVAL ;
}
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
2016-11-28 19:22:47 +03:00
if ( val = = - 1 )
rwb - > min_lat_nsec = wbt_default_latency_nsec ( q ) ;
else if ( val > = 0 )
rwb - > min_lat_nsec = val * 1000ULL ;
2016-11-28 19:40:34 +03:00
if ( rwb - > enable_state = = WBT_STATE_ON_DEFAULT )
rwb - > enable_state = WBT_STATE_ON_MANUAL ;
2016-11-28 19:22:47 +03:00
wbt_update_limits ( rwb ) ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
return count ;
}
2016-04-12 21:32:46 +03:00
static ssize_t queue_wc_show ( struct request_queue * q , char * page )
{
if ( test_bit ( QUEUE_FLAG_WC , & q - > queue_flags ) )
return sprintf ( page , " write back \n " ) ;
return sprintf ( page , " write through \n " ) ;
}
static ssize_t queue_wc_store ( struct request_queue * q , const char * page ,
size_t count )
{
int set = - 1 ;
if ( ! strncmp ( page , " write back " , 10 ) )
set = 1 ;
else if ( ! strncmp ( page , " write through " , 13 ) | |
! strncmp ( page , " none " , 4 ) )
set = 0 ;
if ( set = = - 1 )
return - EINVAL ;
spin_lock_irq ( q - > queue_lock ) ;
if ( set )
queue_flag_set ( QUEUE_FLAG_WC , q ) ;
else
queue_flag_clear ( QUEUE_FLAG_WC , q ) ;
spin_unlock_irq ( q - > queue_lock ) ;
return count ;
}
2016-06-24 00:05:51 +03:00
static ssize_t queue_dax_show ( struct request_queue * q , char * page )
{
return queue_var_show ( blk_queue_dax ( q ) , page ) ;
}
2016-11-08 07:32:37 +03:00
static ssize_t print_stat ( char * page , struct blk_rq_stat * stat , const char * pre )
{
return sprintf ( page , " %s samples=%llu, mean=%lld, min=%lld, max=%lld \n " ,
pre , ( long long ) stat - > nr_samples ,
( long long ) stat - > mean , ( long long ) stat - > min ,
( long long ) stat - > max ) ;
}
static ssize_t queue_stats_show ( struct request_queue * q , char * page )
{
struct blk_rq_stat stat [ 2 ] ;
ssize_t ret ;
blk_queue_stat_get ( q , stat ) ;
ret = print_stat ( page , & stat [ BLK_STAT_READ ] , " read : " ) ;
ret + = print_stat ( page + ret , & stat [ BLK_STAT_WRITE ] , " write: " ) ;
return ret ;
}
2008-01-29 16:51:59 +03:00
static struct queue_sysfs_entry queue_requests_entry = {
. attr = { . name = " nr_requests " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_requests_show ,
. store = queue_requests_store ,
} ;
static struct queue_sysfs_entry queue_ra_entry = {
. attr = { . name = " read_ahead_kb " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_ra_show ,
. store = queue_ra_store ,
} ;
static struct queue_sysfs_entry queue_max_sectors_entry = {
. attr = { . name = " max_sectors_kb " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_max_sectors_show ,
. store = queue_max_sectors_store ,
} ;
static struct queue_sysfs_entry queue_max_hw_sectors_entry = {
. attr = { . name = " max_hw_sectors_kb " , . mode = S_IRUGO } ,
. show = queue_max_hw_sectors_show ,
} ;
2010-03-10 08:48:33 +03:00
static struct queue_sysfs_entry queue_max_segments_entry = {
. attr = { . name = " max_segments " , . mode = S_IRUGO } ,
. show = queue_max_segments_show ,
} ;
2010-09-10 22:50:10 +04:00
static struct queue_sysfs_entry queue_max_integrity_segments_entry = {
. attr = { . name = " max_integrity_segments " , . mode = S_IRUGO } ,
. show = queue_max_integrity_segments_show ,
} ;
2010-03-10 08:48:33 +03:00
static struct queue_sysfs_entry queue_max_segment_size_entry = {
. attr = { . name = " max_segment_size " , . mode = S_IRUGO } ,
. show = queue_max_segment_size_show ,
} ;
2008-01-29 16:51:59 +03:00
static struct queue_sysfs_entry queue_iosched_entry = {
. attr = { . name = " scheduler " , . mode = S_IRUGO | S_IWUSR } ,
. show = elv_iosched_show ,
. store = elv_iosched_store ,
} ;
2008-01-29 21:14:08 +03:00
static struct queue_sysfs_entry queue_hw_sector_size_entry = {
. attr = { . name = " hw_sector_size " , . mode = S_IRUGO } ,
2009-05-23 01:17:49 +04:00
. show = queue_logical_block_size_show ,
} ;
static struct queue_sysfs_entry queue_logical_block_size_entry = {
. attr = { . name = " logical_block_size " , . mode = S_IRUGO } ,
. show = queue_logical_block_size_show ,
2008-01-29 21:14:08 +03:00
} ;
2009-05-23 01:17:53 +04:00
static struct queue_sysfs_entry queue_physical_block_size_entry = {
. attr = { . name = " physical_block_size " , . mode = S_IRUGO } ,
. show = queue_physical_block_size_show ,
} ;
2016-10-18 09:40:30 +03:00
static struct queue_sysfs_entry queue_chunk_sectors_entry = {
. attr = { . name = " chunk_sectors " , . mode = S_IRUGO } ,
. show = queue_chunk_sectors_show ,
} ;
2009-05-23 01:17:53 +04:00
static struct queue_sysfs_entry queue_io_min_entry = {
. attr = { . name = " minimum_io_size " , . mode = S_IRUGO } ,
. show = queue_io_min_show ,
} ;
static struct queue_sysfs_entry queue_io_opt_entry = {
. attr = { . name = " optimal_io_size " , . mode = S_IRUGO } ,
. show = queue_io_opt_show ,
2008-01-29 21:14:08 +03:00
} ;
2009-11-10 13:50:21 +03:00
static struct queue_sysfs_entry queue_discard_granularity_entry = {
. attr = { . name = " discard_granularity " , . mode = S_IRUGO } ,
. show = queue_discard_granularity_show ,
} ;
2015-07-16 18:14:26 +03:00
static struct queue_sysfs_entry queue_discard_max_hw_entry = {
. attr = { . name = " discard_max_hw_bytes " , . mode = S_IRUGO } ,
. show = queue_discard_max_hw_show ,
} ;
2009-11-10 13:50:21 +03:00
static struct queue_sysfs_entry queue_discard_max_entry = {
2015-07-16 18:14:26 +03:00
. attr = { . name = " discard_max_bytes " , . mode = S_IRUGO | S_IWUSR } ,
2009-11-10 13:50:21 +03:00
. show = queue_discard_max_show ,
2015-07-16 18:14:26 +03:00
. store = queue_discard_max_store ,
2009-11-10 13:50:21 +03:00
} ;
2009-12-03 11:24:48 +03:00
static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
. attr = { . name = " discard_zeroes_data " , . mode = S_IRUGO } ,
. show = queue_discard_zeroes_data_show ,
} ;
2012-09-18 20:19:27 +04:00
static struct queue_sysfs_entry queue_write_same_max_entry = {
. attr = { . name = " write_same_max_bytes " , . mode = S_IRUGO } ,
. show = queue_write_same_max_show ,
} ;
2016-11-30 23:28:59 +03:00
static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
. attr = { . name = " write_zeroes_max_bytes " , . mode = S_IRUGO } ,
. show = queue_write_zeroes_max_show ,
} ;
2009-01-07 14:22:39 +03:00
static struct queue_sysfs_entry queue_nonrot_entry = {
. attr = { . name = " rotational " , . mode = S_IRUGO | S_IWUSR } ,
2010-08-07 20:13:50 +04:00
. show = queue_show_nonrot ,
. store = queue_store_nonrot ,
2009-01-07 14:22:39 +03:00
} ;
2016-10-18 09:40:29 +03:00
static struct queue_sysfs_entry queue_zoned_entry = {
. attr = { . name = " zoned " , . mode = S_IRUGO } ,
. show = queue_zoned_show ,
} ;
2008-04-29 16:44:19 +04:00
static struct queue_sysfs_entry queue_nomerges_entry = {
. attr = { . name = " nomerges " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_nomerges_show ,
. store = queue_nomerges_store ,
} ;
2008-09-13 22:26:01 +04:00
static struct queue_sysfs_entry queue_rq_affinity_entry = {
. attr = { . name = " rq_affinity " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_rq_affinity_show ,
. store = queue_rq_affinity_store ,
} ;
2009-01-23 12:54:44 +03:00
static struct queue_sysfs_entry queue_iostats_entry = {
. attr = { . name = " iostats " , . mode = S_IRUGO | S_IWUSR } ,
2010-08-07 20:13:50 +04:00
. show = queue_show_iostats ,
. store = queue_store_iostats ,
2009-01-23 12:54:44 +03:00
} ;
2010-06-09 12:42:09 +04:00
static struct queue_sysfs_entry queue_random_entry = {
. attr = { . name = " add_random " , . mode = S_IRUGO | S_IWUSR } ,
2010-08-07 20:13:50 +04:00
. show = queue_show_random ,
. store = queue_store_random ,
2010-06-09 12:42:09 +04:00
} ;
2015-11-05 20:44:55 +03:00
static struct queue_sysfs_entry queue_poll_entry = {
. attr = { . name = " io_poll " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_poll_show ,
. store = queue_poll_store ,
} ;
2016-11-14 23:01:59 +03:00
static struct queue_sysfs_entry queue_poll_delay_entry = {
. attr = { . name = " io_poll_delay " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_poll_delay_show ,
. store = queue_poll_delay_store ,
} ;
2016-04-12 21:32:46 +03:00
static struct queue_sysfs_entry queue_wc_entry = {
. attr = { . name = " write_cache " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_wc_show ,
. store = queue_wc_store ,
} ;
2016-06-24 00:05:51 +03:00
static struct queue_sysfs_entry queue_dax_entry = {
. attr = { . name = " dax " , . mode = S_IRUGO } ,
. show = queue_dax_show ,
} ;
2016-11-08 07:32:37 +03:00
static struct queue_sysfs_entry queue_stats_entry = {
. attr = { . name = " stats " , . mode = S_IRUGO } ,
. show = queue_stats_show ,
} ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
static struct queue_sysfs_entry queue_wb_lat_entry = {
. attr = { . name = " wbt_lat_usec " , . mode = S_IRUGO | S_IWUSR } ,
. show = queue_wb_lat_show ,
. store = queue_wb_lat_store ,
} ;
2008-01-29 16:51:59 +03:00
static struct attribute * default_attrs [ ] = {
& queue_requests_entry . attr ,
& queue_ra_entry . attr ,
& queue_max_hw_sectors_entry . attr ,
& queue_max_sectors_entry . attr ,
2010-03-10 08:48:33 +03:00
& queue_max_segments_entry . attr ,
2010-09-10 22:50:10 +04:00
& queue_max_integrity_segments_entry . attr ,
2010-03-10 08:48:33 +03:00
& queue_max_segment_size_entry . attr ,
2008-01-29 16:51:59 +03:00
& queue_iosched_entry . attr ,
2008-01-29 21:14:08 +03:00
& queue_hw_sector_size_entry . attr ,
2009-05-23 01:17:49 +04:00
& queue_logical_block_size_entry . attr ,
2009-05-23 01:17:53 +04:00
& queue_physical_block_size_entry . attr ,
2016-10-18 09:40:30 +03:00
& queue_chunk_sectors_entry . attr ,
2009-05-23 01:17:53 +04:00
& queue_io_min_entry . attr ,
& queue_io_opt_entry . attr ,
2009-11-10 13:50:21 +03:00
& queue_discard_granularity_entry . attr ,
& queue_discard_max_entry . attr ,
2015-07-16 18:14:26 +03:00
& queue_discard_max_hw_entry . attr ,
2009-12-03 11:24:48 +03:00
& queue_discard_zeroes_data_entry . attr ,
2012-09-18 20:19:27 +04:00
& queue_write_same_max_entry . attr ,
2016-11-30 23:28:59 +03:00
& queue_write_zeroes_max_entry . attr ,
2009-01-07 14:22:39 +03:00
& queue_nonrot_entry . attr ,
2016-10-18 09:40:29 +03:00
& queue_zoned_entry . attr ,
2008-04-29 16:44:19 +04:00
& queue_nomerges_entry . attr ,
2008-09-13 22:26:01 +04:00
& queue_rq_affinity_entry . attr ,
2009-01-23 12:54:44 +03:00
& queue_iostats_entry . attr ,
2010-06-09 12:42:09 +04:00
& queue_random_entry . attr ,
2015-11-05 20:44:55 +03:00
& queue_poll_entry . attr ,
2016-04-12 21:32:46 +03:00
& queue_wc_entry . attr ,
2016-06-24 00:05:51 +03:00
& queue_dax_entry . attr ,
2016-11-08 07:32:37 +03:00
& queue_stats_entry . attr ,
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
& queue_wb_lat_entry . attr ,
2016-11-14 23:01:59 +03:00
& queue_poll_delay_entry . attr ,
2008-01-29 16:51:59 +03:00
NULL ,
} ;
# define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr)
static ssize_t
queue_attr_show ( struct kobject * kobj , struct attribute * attr , char * page )
{
struct queue_sysfs_entry * entry = to_queue ( attr ) ;
struct request_queue * q =
container_of ( kobj , struct request_queue , kobj ) ;
ssize_t res ;
if ( ! entry - > show )
return - EIO ;
mutex_lock ( & q - > sysfs_lock ) ;
2012-11-28 16:42:38 +04:00
if ( blk_queue_dying ( q ) ) {
2008-01-29 16:51:59 +03:00
mutex_unlock ( & q - > sysfs_lock ) ;
return - ENOENT ;
}
res = entry - > show ( q , page ) ;
mutex_unlock ( & q - > sysfs_lock ) ;
return res ;
}
static ssize_t
queue_attr_store ( struct kobject * kobj , struct attribute * attr ,
const char * page , size_t length )
{
struct queue_sysfs_entry * entry = to_queue ( attr ) ;
2008-01-31 15:03:55 +03:00
struct request_queue * q ;
2008-01-29 16:51:59 +03:00
ssize_t res ;
if ( ! entry - > store )
return - EIO ;
2008-01-31 15:03:55 +03:00
q = container_of ( kobj , struct request_queue , kobj ) ;
2008-01-29 16:51:59 +03:00
mutex_lock ( & q - > sysfs_lock ) ;
2012-11-28 16:42:38 +04:00
if ( blk_queue_dying ( q ) ) {
2008-01-29 16:51:59 +03:00
mutex_unlock ( & q - > sysfs_lock ) ;
return - ENOENT ;
}
res = entry - > store ( q , page , length ) ;
mutex_unlock ( & q - > sysfs_lock ) ;
return res ;
}
2013-01-09 20:05:13 +04:00
static void blk_free_queue_rcu ( struct rcu_head * rcu_head )
{
struct request_queue * q = container_of ( rcu_head , struct request_queue ,
rcu_head ) ;
kmem_cache_free ( blk_requestq_cachep , q ) ;
}
2008-01-29 16:51:59 +03:00
/**
2011-09-21 12:01:22 +04:00
* blk_release_queue : - release a & struct request_queue when it is no longer needed
* @ kobj : the kobj belonging to the request queue to be released
2008-01-29 16:51:59 +03:00
*
* Description :
2011-09-21 12:01:22 +04:00
* blk_release_queue is the pair to blk_init_queue ( ) or
2008-01-29 16:51:59 +03:00
* blk_queue_make_request ( ) . It should be called when a request queue is
* being released ; typically when a block device is being de - registered .
* Currently , its primary task it to free all the & struct request
* structures that were allocated to the queue and the queue itself .
*
2014-12-09 18:57:48 +03:00
* Note :
* The low level driver must have finished any outstanding requests first
* via blk_cleanup_queue ( ) .
2008-01-29 16:51:59 +03:00
* */
static void blk_release_queue ( struct kobject * kobj )
{
struct request_queue * q =
container_of ( kobj , struct request_queue , kobj ) ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
wbt_exit ( q ) ;
block: don't release bdi while request_queue has live references
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.
A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.
a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.
While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com>
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-08 19:20:22 +03:00
bdi_exit ( & q - > backing_dev_info ) ;
2012-03-06 01:15:20 +04:00
blkcg_exit_queue ( q ) ;
2011-12-14 03:33:42 +04:00
if ( q - > elevator ) {
spin_lock_irq ( q - > queue_lock ) ;
ioc_clear_queue ( q ) ;
spin_unlock_irq ( q - > queue_lock ) ;
2011-09-28 18:07:01 +04:00
elevator_exit ( q - > elevator ) ;
2011-12-14 03:33:42 +04:00
}
2011-09-28 18:07:01 +04:00
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 02:05:44 +04:00
blk_exit_rl ( & q - > root_rl ) ;
2008-01-29 16:51:59 +03:00
if ( q - > queue_tags )
__blk_queue_free_tags ( q ) ;
2014-12-09 18:57:48 +03:00
if ( ! q - > mq_ops )
2014-09-25 19:23:47 +04:00
blk_free_flush_queue ( q - > fq ) ;
2015-01-29 15:17:27 +03:00
else
blk_mq_release ( q ) ;
2014-02-10 20:29:00 +04:00
2008-01-29 16:51:59 +03:00
blk_trace_shutdown ( q ) ;
2015-04-24 08:37:18 +03:00
if ( q - > bio_split )
bioset_free ( q - > bio_split ) ;
2011-12-14 03:33:37 +04:00
ida_simple_remove ( & blk_queue_ida , q - > id ) ;
2013-01-09 20:05:13 +04:00
call_rcu ( & q - > rcu_head , blk_free_queue_rcu ) ;
2008-01-29 16:51:59 +03:00
}
2010-01-19 04:58:23 +03:00
static const struct sysfs_ops queue_sysfs_ops = {
2008-01-29 16:51:59 +03:00
. show = queue_attr_show ,
. store = queue_attr_store ,
} ;
struct kobj_type blk_queue_ktype = {
. sysfs_ops = & queue_sysfs_ops ,
. default_attrs = default_attrs ,
. release = blk_release_queue ,
} ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
static void blk_wb_init ( struct request_queue * q )
{
# ifndef CONFIG_BLK_WBT_MQ
if ( q - > mq_ops )
return ;
# endif
# ifndef CONFIG_BLK_WBT_SQ
if ( q - > request_fn )
return ;
# endif
/*
* If this fails , we don ' t get throttling
*/
2016-11-11 07:50:51 +03:00
wbt_init ( q ) ;
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
}
2008-01-29 16:51:59 +03:00
int blk_register_queue ( struct gendisk * disk )
{
int ret ;
2009-04-14 10:00:05 +04:00
struct device * dev = disk_to_dev ( disk ) ;
2008-01-29 16:51:59 +03:00
struct request_queue * q = disk - > queue ;
2008-04-21 11:51:06 +04:00
if ( WARN_ON ( ! q ) )
2008-01-29 16:51:59 +03:00
return - ENXIO ;
2012-09-21 01:08:52 +04:00
/*
2014-09-24 21:31:50 +04:00
* SCSI probing may synchronously create and destroy a lot of
* request_queues for non - existent devices . Shutting down a fully
* functional queue takes measureable wallclock time as RCU grace
* periods are involved . To avoid excessive latency in these
* cases , a request_queue starts out in a degraded mode which is
* faster to shut down and is made fully functional here as
* request_queues for non - existent devices never get registered .
2012-09-21 01:08:52 +04:00
*/
2014-09-09 19:50:58 +04:00
if ( ! blk_queue_init_done ( q ) ) {
queue_flag_set_unlocked ( QUEUE_FLAG_INIT_DONE , q ) ;
2015-10-21 20:20:12 +03:00
percpu_ref_switch_to_percpu ( & q - > q_usage_counter ) ;
2014-09-09 19:50:58 +04:00
blk_queue_bypass_end ( q ) ;
}
2012-09-21 01:08:52 +04:00
2009-04-14 10:00:05 +04:00
ret = blk_trace_init_sysfs ( dev ) ;
if ( ret )
return ret ;
2009-06-11 21:52:27 +04:00
ret = kobject_add ( & q - > kobj , kobject_get ( & dev - > kobj ) , " %s " , " queue " ) ;
2011-04-19 15:47:58 +04:00
if ( ret < 0 ) {
blk_trace_remove_sysfs ( dev ) ;
2008-01-29 16:51:59 +03:00
return ret ;
2011-04-19 15:47:58 +04:00
}
2008-01-29 16:51:59 +03:00
kobject_uevent ( & q - > kobj , KOBJ_ADD ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
if ( q - > mq_ops )
2016-09-16 15:25:06 +03:00
blk_mq_register_dev ( dev , q ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 22:38:14 +03:00
blk_wb_init ( q ) ;
2009-05-23 01:17:52 +04:00
if ( ! q - > request_fn )
return 0 ;
2008-01-29 16:51:59 +03:00
ret = elv_register_queue ( q ) ;
if ( ret ) {
kobject_uevent ( & q - > kobj , KOBJ_REMOVE ) ;
kobject_del ( & q - > kobj ) ;
2011-04-14 00:14:54 +04:00
blk_trace_remove_sysfs ( dev ) ;
2010-08-23 14:30:29 +04:00
kobject_put ( & dev - > kobj ) ;
2008-01-29 16:51:59 +03:00
return ret ;
}
return 0 ;
}
void blk_unregister_queue ( struct gendisk * disk )
{
struct request_queue * q = disk - > queue ;
2008-04-21 11:51:06 +04:00
if ( WARN_ON ( ! q ) )
return ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
if ( q - > mq_ops )
2016-09-16 15:25:06 +03:00
blk_mq_unregister_dev ( disk_to_dev ( disk ) , q ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 12:20:05 +04:00
2009-09-25 08:19:26 +04:00
if ( q - > request_fn )
2008-01-29 16:51:59 +03:00
elv_unregister_queue ( q ) ;
2009-09-25 08:19:26 +04:00
kobject_uevent ( & q - > kobj , KOBJ_REMOVE ) ;
kobject_del ( & q - > kobj ) ;
blk_trace_remove_sysfs ( disk_to_dev ( disk ) ) ;
kobject_put ( & disk_to_dev ( disk ) - > kobj ) ;
2008-01-29 16:51:59 +03:00
}