IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patchset ("stable page writes, part 2") makes some key
modifications to the original 'stable page writes' patchset. First, it
provides creators (devices and filesystems) of a backing_dev_info a flag
that declares whether or not it is necessary to ensure that page
contents cannot change during writeout. It is no longer assumed that
this is true of all devices (which was never true anyway). Second, the
flag is used to relaxed the wait_on_page_writeback calls so that wait
only occurs if the device needs it. Third, it fixes up the remaining
disk-backed filesystems to use this improved conditional-wait logic to
provide stable page writes on those filesystems.
It is hoped that (for people not using checksumming devices, anyway)
this patchset will give back unnecessary performance decreases since the
original stable page write patchset went into 3.0. Sorry about not
fixing it sooner.
Complaints were registered by several people about the long write
latencies introduced by the original stable page write patchset.
Generally speaking, the kernel ought to allocate as little extra memory
as possible to facilitate writeout, but for people who simply cannot
wait, a second page stability strategy is (re)introduced: snapshotting
page contents. The waiting behavior is still the default strategy; to
enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
set. This flag is used to bandaid^Henable stable page writeback on
ext3[1], and is not used anywhere else.
Given that there are already a few storage devices and network FSes that
have rolled their own page stability wait/page snapshot code, it would
be nice to move towards consolidating all of these. It seems possible
that iscsi and raid5 may wish to use the new stable page write support
to enable zero-copy writeout.
Thank you to Jan Kara for helping fix a couple more filesystems.
Per Andrew Morton's request, here are the result of using dbench to measure
latencies on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, for ext2 the maximum write latency decreases from ~60ms
on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
increase, though I suspect that being able to dirty pages faster gives
the flusher more work to do.
On ext4, the average write latency decreases as well as all the maximum
latencies:
3.8.0-rc3:
WriteX 85624 0.152 33.078
ReadX 272090 0.010 61.210
Flush 12129 36.219 168.260
Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms
3.8.0-rc3 + patches:
WriteX 86082 0.141 30.928
ReadX 273358 0.010 36.124
Flush 12214 34.800 165.689
Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms
XFS seems to exhibit similar latency improvements as ext2:
3.8.0-rc3:
WriteX 125739 0.028 104.343
ReadX 399070 0.005 4.115
Flush 17851 25.004 131.390
Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms
3.8.0-rc3 + patches:
WriteX 123529 0.028 6.299
ReadX 392434 0.005 4.287
Flush 17549 25.120 188.687
Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms
...and btrfs, just to round things out, also shows some latency
decreases:
3.8.0-rc3:
WriteX 67122 0.083 82.355
ReadX 212719 0.005 2.828
Flush 9547 47.561 147.418
Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms
3.8.0-rc3 + patches:
WriteX 64898 0.101 71.631
ReadX 206673 0.005 7.123
Flush 9190 47.963 219.034
Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own wait code, or they don't block at all. The blocking
behavior is back to what it was before 3.0 if you don't have a disk
requiring stable page writes.
This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
results as -rc3.
[1] The alternative fixes to ext3 include fixing the locking order and
page bit handling like we did for ext4 (but then why not just use
ext4?), or setting PG_writeback so early that ext3 becomes extremely
slow. I tried that, but the number of write()s I could initiate dropped
by nearly an order of magnitude. That was a bit much even for the
author of the stable page series! :)
This patch:
Creates a per-backing-device flag that tracks whether or not pages must
be held immutable during writeout. Eventually it will be used to waive
wait_for_page_writeback() if nothing requires stable pages.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull async changes from Tejun Heo:
"These are followups for the earlier deadlock issue involving async
ending up waiting for itself through block requesting module[1]. The
following changes are made by these commits.
- Instead of requesting default elevator on each request_queue init,
block now requests it once early during boot.
- Kmod triggers warning if invoked from an async worker.
- Async synchronization implementation has been reimplemented. It's
a lot simpler now."
* 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
async: initialise list heads to fix crash
async: replace list of active domains with global list of pending items
async: keep pending tasks on async_domain and remove async_pending
async: use ULLONG_MAX for infinity cookie value
async: bring sanity to the use of words domain and running
async, kmod: warn on synchronous request_module() from async workers
block: don't request module during elevator init
init, block: try to load default elevator module early during boot
Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Using wait_for_completion() for waiting for a IO request to be executed
results in wrong iowait time accounting. For example, a system having
the only task doing write() and fdatasync() on a block device can be
reported being idle instead of iowaiting as it should because
blkdev_issue_flush() calls wait_for_completion() which in turn calls
schedule() that does not increment the iowait proc counter and thus does
not turn on iowait time accounting.
The patch makes block layer use wait_for_completion_io() instead of
wait_for_completion() where appropriate to account iowait time
correctly.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Block layer allows selecting an elevator which is built as a module to
be selected as system default via kernel param "elevator=". This is
achieved by automatically invoking request_module() whenever a new
block device is initialized and the elevator is not available.
This led to an interesting deadlock problem involving async and module
init. Block device probing running off an async job invokes
request_module(). While the module is being loaded, it performs
async_synchronize_full() which ends up waiting for the async job which
is already waiting for request_module() to finish, leading to
deadlock.
Invoking request_module() from deep in block device init path is
already nasty in itself. It seems best to avoid these situations from
the beginning by moving on-demand module loading out of block init
path.
The previous patch made sure that the default elevator module is
loaded early during boot if available. This patch removes on-demand
loading of the default elevator from elevator init path. As the
module would have been loaded during boot, userland-visible behavior
difference should be minimal.
For more details, please refer to the following thread.
http://thread.gmane.org/gmane.linux.kernel/1420814
v2: The bool parameter was named @request_module which conflicted with
request_module(). This built okay w/ CONFIG_MODULES because
request_module() was defined as a macro. W/o CONFIG_MODULES, it
causes build breakage. Rename the parameter to @try_loading.
Reported by Fengguang.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
This patch adds default module loading and uses it to load the default
block elevator. During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland. This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.
This will replace the on-demand elevator module loading from elevator
init path.
v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang We <fengguang.wu@intel.com>
bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into. Add @req to it. This makes it impossible to share the
event template with block_bio_queue - split it out.
@req isn't used or exported to userland at this point and there is no
userland visible behavior change. Later changes will make use of the
extra parameter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun writes:
Hello, Jens.
Please consider pulling from the following branch to receive cfq blkcg
hierarchy support. The branch is based on top of v3.8-rc2.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy
The patchset was reviewd in the following thread.
http://thread.gmane.org/gmane.linux.kernel.cgroups/5571
In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
Although this commit was used for the situation which blk_plug handled
multi devices on the same time like md device.
I think there must be some situations like this but only single
device.
So remove the should_sort judgement.
Because the parameter should_sort is only for this purpose,it can delete
should_sort from blk_plug.
CC: Shaohua Li <shli@kernel.org>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch elevator to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the elevator.
This also removes the dymanic allocation of the hash table. The size of the table is
constant so there's no point in paying the price of an extra dereference when accessing
it.
This patch depends on d9b482c ("hashtable: introduce a small and naive
hashtable") which was merged in v3.6.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users. Just create recursive counterpart of the existing
stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
To support hierarchical stats, it's necessary to remember stats from
dead children. Add cfqg->dead_stats and make a dying cfqg transfer
its stats to the parent's dead-stats.
The transfer happens form ->pd_offline_fn() and it is possible that
there are some residual IOs completing afterwards. Currently, we lose
these stats. Given that cgroup removal isn't a very high frequency
operation and the amount of residual IOs on offline are likely to be
nil or small, this shouldn't be a big deal and the complexity needed
to handle residual IOs - another callback and rather elaborate
synchronization to reach and lock the matching q - doesn't seem
justified.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
cfq_pd_reset_stats() and move the latter to where other pd methods are
defined. cfqg_stats_reset() will be used to implement hierarchical
stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Instead of holding blkcg->lock while walking ->blkg_list and executing
prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
executing prfill(). This makes prfill() implementations easier as
stats are mostly protected by queue lock.
This will be used to implement hierarchical stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
RCU free request_queue so that blkcg_gq->q can be dereferenced under
RCU lock. This will be used to implement hierarchical stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
The former two collect the [rw]stats designated by the target policy
data and offset from the pd's subtree. The latter two add one
[rw]stat to another.
Note that the recursive sum functions require the queue lock to be
held on entry to make blkg online test reliable. This is necessary to
properly handle stats of a dying blkg.
These will be used to implement hierarchical stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
Export it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for
summing up stats from multiple blkgs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
which are invoked as the policy_data gets activated and deactivated
while holding both blkcg and q locks.
Also, add blkcg_gq->online bool, which is set and cleared as the
blkcg_gq gets activated and deactivated. This flag also is toggled
while holding both blkcg and q locks.
These will be used to implement hierarchical stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Add pd->plid so that the policy a pd belongs to can be identified
easily. This will be used to implement hierarchical blkg_[rw]stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support. The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.
Replace it with cfqg_parent() which returns the real parent. This
enables full blkcg hierarchy support for cfq-iosched. For example,
consider the following hierarchy.
root
/ \
A:500 B:250
/ \
AA:500 AB:1000
For simplicity, let's say all the leaf nodes have active tasks and are
on service tree. For each leaf node, vfraction would be
AA: (500 / 1500) * (500 / 750) =~ 0.2222
AB: (1000 / 1500) * (500 / 750) =~ 0.4444
B: (250 / 750) =~ 0.3333
and vdisktime will be distributed accordingly. For more detail,
please refer to Documentation/block/cfq-iosched.txt.
v2: cfq-iosched.txt updated to describe group scheduling as suggested
by Vivek.
v3: blkio-controller.txt updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight. This currently works only because all
cfqgs are treated to be at the same level.
To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction. As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).
As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way. This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements. It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.
This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly. The scheme is fairly simple and
light-weight.
* When a cfqg is added to the service tree, v(disktime)weight is
calculated. It walks up the tree to root calculating the fraction
it has in the hierarchy. At each level, the fraction can be
calculated as
cfqg->weight / parent->level_weight
By compounding these, the global fraction of vdisktime the cfqg has
claim to - vfraction - can be determined.
* When the cfqg needs to be charged, the charge is scaled inversely
proportionally to the vfraction.
The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.
While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.
cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.
As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs. For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.
Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.
v2: Beefed up comments on vfraction as requested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
To prepare for blkcg hierarchy support, add cfqg->nr_active and
->children_weight. cfqg->nr_active counts the number of active cfqgs
at the cfqg's level and ->children_weight is sum of weights of those
cfqgs. The level covers itself (cfqg->leaf_weight) and immediate
children.
The two values are updated when a cfqg enters and leaves the group
service tree. Unless the hierarchy is very deep, the added overhead
should be negligible.
Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs. This is to make
the transition to hierarchy-aware scheduling gradual. Scheduling
logic will be converted to use cfqg->children_weight without actually
changing the behavior. When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.
This patch doesn't introduce any behavior chagne.
v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's. This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.
We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings. This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.
It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.
This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet. Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight. For root blkgs, the two weights
are kept in sync for backward compatibility.
v2: cfqd->root_group->leaf_weight initialization was missing from
cfq_init_queue() causing divide by zero when
!CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist. ie. Given a blkg, it's not guaranteed that its
ancestors will exist. This makes it difficult to implement proper
hierarchy support for blkcg policies.
Always create blkgs recursively and make a child blkg hold a reference
to its parent. blkg->parent is added so that finding the parent is
easy. blkcg_parent() is also added in the process.
This change can be visible to userland. e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them. While this is userland visible, this
shouldn't cause any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
* Rename out_* labels to err_*.
* Do ERR_PTR() conversion once in the error return path.
This patch is cosmetic and to prepare for the hierarchy support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reorganize such that
* __blkg_lookup() takes bool param @update_hint to determine whether
to update hint.
* __blkg_lookup_create() no longer performs lookup before trying to
create. Renamed to blkg_create().
* blkg_lookup_create() now performs lookup and then invokes
blkg_create() if lookup fails.
* root_blkg creation in blkcg_activate_policy() updated accordingly.
Note that blkcg_activate_policy() no longer updates lookup hint if
root_blkg already exists.
Except for the last lookup hint bit which is immaterial, this is pure
reorganization and doesn't introduce any visible behavior change.
This is to prepare for proper hierarchy support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
The latter test should have been on whether pol->pd_init_fn() exists.
This doesn't cause actual problems because both blkcg policies
implement pol->pd_init_fn(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
whether queues is sync or async. Add one more character "N" to represent
whether it is sync-noidle queue or sync queue. So now three different
type of queues will look as follows.
cfq1234S --> sync queus
cfq1234SN --> sync noidle queue
cfq1234A --> Async queue
Previously S/A classification was being printed only if group scheduling
was enabled. This patch also makes sure that this classification is
displayed even if group idling is disabled.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use of local varibale "n" seems to be unnecessary. Remove it. This brings
it inline with function __cfq_group_st_add(), which is also doing the
similar operation of adding a group to a rb tree.
No functionality change here.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
choose_service_tree() selects/sets both wl_class and wl_type. Rename it to
choose_wl_class_and_type() to make it very clear.
cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
it with choose_st(). So rename it to cfq_choose_wl_type() to make
it clear what does it do.
Just renaming. No functionality change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At quite a few places we use the keyword "service_tree". At some places,
especially local variables, I have abbreviated it to "st".
Also at couple of places moved binary operator "+" from beginning of line
to end of previous line, as per Tejun's feedback.
v2:
Reverted most of the service tree name change based on Jeff Moyer's feedback.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some more renaming. Again making the code uniform w.r.t use of
wl_class/class to represent IO class (RT, BE, IDLE) and using
wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).
At places this patch shortens the string "workload" to "wl".
Renamed "saved_workload" to "saved_wl_type". Renamed
"saved_serving_class" to "saved_wl_class".
For uniformity with "saved_wl_*" variables, renamed "serving_class"
to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".
Again, just trying to improve upon code uniformity and improve
readability. No functional change.
v2:
- Restored the usage of keyword "service" based on Jeff Moyer's feedback.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
are calling workloads belonging to these classes as "prio". This gets
very confusing as one starts to associate it with ioprio.
So this patch just does bunch of renaming so that reading code becomes
easier. All reference to RT, BE and IDLE workload are done using keyword
"class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
using keyword "type".
This makes me feel much better while I am reading the code. There is no
functionality change due to this patch.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Remove a race condition which causes a warning in disk_clear_events. This
is a race between disk_clear_events() and disk_flush_events().
ev->clearing will be altered by disk_flush_events() even though we are
blocking event checking through disk_flush_events(). If this happens
after ev->clearing was cleared for disk_clear_events(), this can cause the
WARN_ON_ONCE() in that function to be triggered.
This change also has disk_clear_events() not go through a workqueue.
Since we have to wait for the work to complete, we should just call the
function directly. Also, since this work cannot be put on a freezable
workqueue, it will have to contend with increased demand, so calling the
function directly avoids this.
[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In disk_clear_events, do not put work on system_nrt_freezable_wq.
Instead, put it on system_nrt_wq.
There is a race between probing a usb and suspending the device. Since
probing a usb calls disk_clear_events, which puts work on a frozen
workqueue, probing cannot finish after the workqueue is frozen. However,
suspending cannot finish until the usb probe is finished, so we get a
deadlock, causing the system to reboot.
The way to reproduce this bug is to wake up from suspend with a usb
storage device plugged in, or plugging in a usb storage device right
before suspend. The window of time is on the order of time it takes to
probe the usb device. As long as the workqueues are frozen before the
call to add_disk within sd_probe_async finishes, there will be a deadlock
(which calls blkdev_get, sd_open, check_disk_change, then
disk_clear_events). This is not difficult to reproduce after figuring out
the timings.
[akpm@linux-foundation.org: fix up comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull block layer core updates from Jens Axboe:
"Here are the core block IO bits for 3.8. The branch contains:
- The final version of the surprise device removal fixups from Bart.
- Don't hide EFI partitions under advanced partition types. It's
fairly wide spread these days. This is especially dangerous for
systems that have both msdos and efi partition tables, where you
want to keep them in sync.
- Cleanup of using -1 instead of the proper NUMA_NO_NODE
- Export control of bdi flusher thread CPU mask and default to using
the home node (if known) from Jeff.
- Export unplug tracepoint for MD.
- Core improvements from Shaohua. Reinstate the recursive merge, as
the original bug has been fixed. Add plugging for discard and also
fix a problem handling non pow-of-2 discard limits.
There's a trivial merge in block/blk-exec.c due to a fix that went
into 3.7-rc at a later point than -rc4 where this is based."
* 'for-3.8/core' of git://git.kernel.dk/linux-block:
block: export block_unplug tracepoint
block: add plug for blkdev_issue_discard
block: discard granularity might not be power of 2
deadline: Allow 0ms deadline latency, increase the read speed
partitions: enable EFI/GPT support by default
bsg: Remove unused function bsg_goose_queue()
block: Make blk_cleanup_queue() wait until request_fn finished
block: Avoid scheduling delayed work on a dead queue
block: Avoid that request_fn is invoked on a dead queue
block: Let blk_drain_queue() caller obtain the queue lock
block: Rename queue dead flag
bdi: add a user-tunable cpu_list for the bdi flusher threads
block: use NUMA_NO_NODE instead of -1
block: recursive merge requests
block CFQ: avoid moving request to different queue
Last post of this patch appears lost, so I resend this.
Now discard merge works, add plug for blkdev_issue_discard. This will help
discard request merge especially for raid0 case. In raid0, a big discard
request is split to small requests, and if correct plug is added, such small
requests can be merged in low layer.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In MD raid case, discard granularity might not be power of 2, for example, a
4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
such cases.
Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup changes from Tejun Heo:
"A lot of activities on cgroup side. The big changes are focused on
making cgroup hierarchy handling saner.
- cgroup_rmdir() had peculiar semantics - it allowed cgroup
destruction to be vetoed by individual controllers and tried to
drain refcnt synchronously. The vetoing never worked properly and
caused good deal of contortions in cgroup. memcg was the last
reamining user. Michal Hocko removed the usage and cgroup_rmdir()
path has been simplified significantly. This was done in a
separate branch so that the memcg people can base further memcg
changes on top.
- The above allowed cleaning up cgroup lifecycle management and
implementation of generic cgroup iterators which are used to
improve hierarchy support.
- cgroup_freezer updated to allow migration in and out of a frozen
cgroup and handle hierarchy. If a cgroup is frozen, all descendant
cgroups are frozen.
- netcls_cgroup and netprio_cgroup updated to handle hierarchy
properly.
- Various fixes and cleanups.
- Two merge commits. One to pull in memcg and rmdir cleanups (needed
to build iterators). The other pulled in cgroup/for-3.7-fixes for
device_cgroup fixes so that further device_cgroup patches can be
stacked on top."
Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)
* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
cgroup: update Documentation/cgroups/00-INDEX
cgroup_rm_file: don't delete the uncreated files
cgroup: remove subsystem files when remounting cgroup
cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
cgroup: warn about broken hierarchies only after css_online
cgroup: list_del_init() on removed events
cgroup: fix lockdep warning for event_control
cgroup: move list add after list head initilization
netprio_cgroup: allow nesting and inherit config on cgroup creation
netprio_cgroup: implement netprio[_set]_prio() helpers
netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
netprio_cgroup: reimplement priomap expansion
netprio_cgroup: shorten variable names in extend_netdev_table()
netprio_cgroup: simplify write_priomap()
netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
cgroup: remove obsolete guarantee from cgroup_task_migrate.
cgroup: add cgroup->id
cgroup, cpuset: remove cgroup_subsys->post_clone()
cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
...
Change a timer compare from after to after-equals, thus allowing
0 timeout and making deadline schedule FIFO.
Signed-off-by: xiaobing tu <xiaobing.tu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The Kconfig currently enables MSDOS partitions by default because they
are assumed to be essential, but it's necessary to enable "advanced
partition selection" in order to get GPT support. IMO GPT partitions
are becoming common enought to deserve the same treatment MSDOS
partitions get.
(Side note: I got bit by a disk that had MSDOS and GPT partition
tables, but for some reason the MSDOS table was different from the
GPT one. I was stupid enought to disable "advanced partition
selection" in my .config, which disabled GPT partitioning and made
my btrfs pool unbootable because it couldn't find the partitions)
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function bsg_goose_queue() does not have any in-tree callers,
so let's remove it.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>