Go to file
Shantanu Goel 333b0a459c mm, vmscan: fix zone balance check in prepare_kswapd_sleep
Patch series "Reduce amount of time kswapd sleeps prematurely", v2.

The series is unusual in that the first patch fixes one problem and
introduces other issues that are noted in the changelog.  Patch 2 makes
a minor modification that is worth considering on its own but leaves the
kernel in a state where it behaves badly.  It's not until patch 3 that
there is an improvement against baseline.

This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP.  It has been
reported that the benchmark has regressed severely during the last
number of releases.  While I cannot reproduce all the same problems
Chris experienced due to hardware limitations, there was a number of
problems on a 2-socket machine with a single disk.

simoop latencies
                                         4.11.0-rc1            4.11.0-rc1
                                            vanilla          keepawake-v2
Amean    p50-Read             21670074.18 (  0.00%) 22668332.52 ( -4.61%)
Amean    p95-Read             25456267.64 (  0.00%) 26738688.00 ( -5.04%)
Amean    p99-Read             29369064.73 (  0.00%) 30991404.52 ( -5.52%)
Amean    p50-Write                1390.30 (  0.00%)      924.91 ( 33.47%)
Amean    p95-Write              412901.57 (  0.00%)     1362.62 ( 99.67%)
Amean    p99-Write             6668722.09 (  0.00%)    16854.04 ( 99.75%)
Amean    p50-Allocation          78714.31 (  0.00%)    74729.74 (  5.06%)
Amean    p95-Allocation         175533.51 (  0.00%)   101609.74 ( 42.11%)
Amean    p99-Allocation         247003.02 (  0.00%)   125765.57 ( 49.08%)

These are latencies.  Read/write are threads reading fixed-size random
blocks from a simulated database.  The allocation latency is mmaping and
faulting regions of memory.  The p50, 95 and p99 reports the worst
latencies for 50% of the samples, 95% and 99% respectively.

For example, the report indicates that while the test was running 99% of
writes completed 99.75% faster.  It's worth noting that on a UMA machine
that no difference in performance with simoop was observed so milage
will vary.

It's noted that there is a slight impact to read latencies but it's
mostly due to IO scheduler decisions and offset by the large reduction
in other latencies.

This patch (of 3):

The check in prepare_kswapd_sleep needs to match the one in
balance_pgdat since the latter will return as soon as any one of the
zones in the classzone is above the watermark.  This is specially
important for higher order allocations since balance_pgdat will
typically reset the order to zero relying on compaction to create the
higher order pages.  Without this patch, prepare_kswapd_sleep fails to
wake up kcompactd since the zone balance check fails.

It was first reported against 4.9.7 that kswapd is failing to wake up
kcompactd due to a mismatch in the zone balance check between
balance_pgdat() and prepare_kswapd_sleep().

balance_pgdat() returns as soon as a single zone satisfies the
allocation but prepare_kswapd_sleep() requires all zones to do +the
same.  This causes prepare_kswapd_sleep() to never succeed except in the
order == 0 case and consequently, wakeup_kcompactd() is never called.
For the machine that originally motivated this patch, the state of
compaction from /proc/vmstat looked this way after a day and a half +of
uptime:

compact_migrate_scanned 240496
compact_free_scanned 76238632
compact_isolated 123472
compact_stall 1791
compact_fail 29
compact_success 1762
compact_daemon_wake 0

After applying the patch and about 10 hours of uptime the state looks
like this:

compact_migrate_scanned 59927299
compact_free_scanned 2021075136
compact_isolated 640926
compact_stall 4
compact_fail 2
compact_success 2
compact_daemon_wake 5160

Further notes from Mel that motivated him to pick this patch up and
resend it;

It was observed for the simoop workload (pressures the VM similar to
HADOOP) that kswapd was failing to keep ahead of direct reclaim.  The
investigation noted that there was a need to rationalise kswapd
decisions to reclaim with kswapd decisions to sleep.  With this patch on
a 2-socket box, there was a 49% reduction in direct reclaim scanning.

However, the impact otherwise is extremely negative.  Kswapd reclaim
efficiency dropped from 98% to 76%.  simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).

                                         4.11.0-rc1            4.11.0-rc1
                                            vanilla           fixcheck-v2
Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)

Of greater concern is that the patch causes swapping and page writes
from kswapd context rose from 0 pages to 4189753 pages during the hour
the workload ran for.  By and large, the patch has very bad behaviour
but easily missed as the impact on a UMA machine is negligible.

This patch is included with the data in case a bisection leads to this
area.  This patch is also a pre-requisite for the rest of the series.

Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.net
Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
arch blackfin: bf609: let clk_disable() return immediately if clk is NULL 2017-05-03 15:52:07 -07:00
block A reasonably busy cycle for documentation this time around. There is a new 2017-05-02 10:21:17 -07:00
certs KEYS: Use structure to capture key restriction function and data 2017-04-04 14:10:10 -07:00
crypto Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2017-05-03 08:50:52 -07:00
Documentation mm: memcontrol: provide shmem statistics 2017-05-03 15:52:08 -07:00
drivers Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2017-05-03 08:50:52 -07:00
firmware WHENCE: use https://linuxtv.org for LinuxTV URLs 2015-12-04 10:35:11 -02:00
fs jbd2: make the whole kjournald2 kthread NOFS safe 2017-05-03 15:52:09 -07:00
include mm: remove rodata_test_data export, add pr_fmt 2017-05-03 15:52:09 -07:00
init Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2017-05-02 19:09:35 -07:00
ipc kernel-api.rst: fix a series of errors when parsing C files 2017-04-02 14:31:49 -06:00
kernel mm: introduce memalloc_nofs_{save,restore} API 2017-05-03 15:52:09 -07:00
lib lockdep: allow to disable reclaim lockup detection 2017-05-03 15:52:09 -07:00
mm mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-05-03 15:52:09 -07:00
net Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit 2017-05-03 09:21:59 -07:00
samples Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching 2017-05-02 18:24:16 -07:00
scripts scripts/spelling.txt: add several more common spelling mistakes 2017-05-03 15:52:07 -07:00
security Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2017-05-03 08:50:52 -07:00
sound LED updates for 4.12 2017-05-01 13:12:49 -07:00
tools Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2017-05-02 19:09:35 -07:00
usr kbuild: initramfs cleanup, set target from Kconfig 2017-01-05 09:40:16 -08:00
virt KVM/ARM Fixes for v4.11-rc6 2017-04-05 16:27:47 +02:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild 2016-08-02 16:48:52 -04:00
.mailmap mailmap: add Martin Kepplinger's email 2017-04-13 18:24:21 -07:00
COPYING
CREDITS avr32: remove support for AVR32 architecture 2017-05-01 09:27:15 +02:00
Kbuild scripts/gdb: provide linux constants 2016-05-23 17:04:14 -07:00
Kconfig
MAINTAINERS Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2017-05-02 19:09:35 -07:00
Makefile A reasonably busy cycle for documentation this time around. There is a new 2017-05-02 10:21:17 -07:00
README README: add a new README file, pointing to the Documentation/ 2016-10-24 08:12:35 -02:00

Linux kernel
============

This file was moved to Documentation/admin-guide/README.rst

Please notice that there are several guides for kernel developers and users.
These guides can be rendered in a number of formats, like HTML and PDF.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.