9225 Commits

Author SHA1 Message Date
Konstantin Khlebnikov
761b06771a mm: completely remove dumping per-cpu lists from show_mem()
It seems nobody needs this.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Konstantin Khlebnikov
d1bfcdb8ce mm: hide per-cpu lists in output of show_mem()
This makes show_mem() much less verbose on huge machines.  Instead of huge
and almost useless dump of counters for each per-zone per-cpu lists this
patch prints the sum of these counters for each zone (free_pcp) and size
of per-cpu list for current cpu (local_pcp).

The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.

[akpm@linux-foundation.org: update show_free_areas comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Konstantin Khlebnikov
b9ea25152e page_writeback: clean up mess around cancel_dirty_page()
This patch replaces cancel_dirty_page() with a helper function
account_page_cleaned() which only updates counters.  It's called from
truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
Page is locked in both cases, page-lock protects against concurrent
dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
ongoing truncation").

Delete_from_page_cache() shouldn't be called for dirty pages, they must
be handled by caller (either written or truncated).  This patch treats
final dirty accounting fixup at the end of __delete_from_page_cache() as
a debug check and adds WARN_ON_ONCE() around it.  If something removes
dirty pages without proper handling that might be a bug and unwritten
data might be lost.

Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
here.

cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
helper for nfs_invalidate_page() and it's called only in case complete
invalidation.

The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
and make try_to_free_buffers() not race with dirty pages") and
3e67c0987d75 ("truncate: clear page dirtiness before running
try_to_free_buffers()") first was reverted right in v2.6.20 in commit
ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
data=journal").

Custom fixes were introduced between these points.  NFS in v2.6.23, commit
1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
dirty page accounting when removing a page from the page cache").  Since
v2.6.25 all of them are redundant.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Ebru Akagunduz
ca0984caa8 mm: incorporate zero pages into transparent huge pages
This patch improves THP collapse rates, by allowing zero pages.

Currently THP can collapse 4kB pages into a THP when there are up to
khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch counts
pte none and mapped zero pages with the same variable.

The patch was tested with a program that allocates 800MB of
memory, and performs interleaved reads and writes, in a pattern
that causes some 2MB areas to first see read accesses, resulting
in the zero pfn being mapped there.

To simulate memory fragmentation at allocation time, I modified
do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults.

Without the patch, only %50 of the program was collapsed into THP and the
percentage did not increase over time.

With this patch after 10 minutes of waiting khugepaged had collapsed %99
of the program's memory.

[aarcange@redhat.com: fix bogus BUG()]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
2149cdaef6 mm/compaction: enhance compaction finish condition
Compaction has anti fragmentation algorithm.  It is that freepage should
be more than pageblock order to finish the compaction if we don't find any
freepage in requested migratetype buddy list.  This is for mitigating
fragmentation, but, there is a lack of migratetype consideration and it is
too excessive compared to page allocator's anti fragmentation algorithm.

Not considering migratetype would cause premature finish of compaction.
For example, if allocation request is for unmovable migratetype, freepage
with CMA migratetype doesn't help that allocation and compaction should
not be stopped.  But, current logic regards this situation as compaction
is no longer needed, so finish the compaction.

Secondly, condition is too excessive compared to page allocator's logic.
We can steal freepage from other migratetype and change pageblock
migratetype on more relaxed conditions in page allocator.  This is
designed to prevent fragmentation and we can use it here.  Imposing hard
constraint only to the compaction doesn't help much in this case since
page allocator would cause fragmentation again.

To solve these problems, this patch borrows anti fragmentation logic from
page allocator.  It will reduce premature compaction finish in some cases
and reduce excessive compaction work.

stress-highalloc test in mmtests with non movable order 7 allocation shows
considerable increase of compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
31.82 : 42.20

I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
there is no more degradation on allocation success rate than before.  That
roughly means that this patch doesn't result in more fragmentations.

Vlastimil suggests additional idea that we only test for fallbacks when
migration scanner has scanned a whole pageblock.  It looked good for
fragmentation because chance of stealing increase due to making more free
pages in certain pageblock.  So, I tested it, but, it results in decreased
compaction success rate, roughly 38.00.  I guess the reason that if system
is low memory condition, watermark check could be failed due to not enough
order 0 free page and so, sometimes, we can't reach a fallback check
although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
to cope with this situation but it makes code more complicated so I don't
include his idea at this patch.

[akpm@linux-foundation.org: fix CONFIG_CMA=n build]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
4eb7dce620 mm/page_alloc: factor out fallback freepage checking
This is preparation step to use page allocator's anti fragmentation logic
in compaction.  This patch just separates fallback freepage checking part
from fallback freepage management part.  Therefore, there is no functional
change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
dc67647b78 mm/cma: change fallback behaviour for CMA freepage
Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they
should not be expanded to other migratetype buddy list to protect them
from unmovable/reclaimable allocation.  Implementing these requirements in
__rmqueue_fallback(), that is, finding largest possible block of freepage
has bad effect that high order freepage with MIGRATE_CMA are broken
continually although there are suitable order CMA freepage.  Reason is
that they are not be expanded to other migratetype buddy list and next
__rmqueue_fallback() invocation try to finds another largest block of
freepage and break it again.  So, MIGRATE_CMA fallback should be handled
separately.  This patch introduces __rmqueue_cma_fallback(), that just
wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if
migratetype == MIGRATE_MOVABLE.

This results in unintended behaviour change that MIGRATE_CMA freepage is
always used first rather than other migratetype as movable allocation's
fallback.  But, as already mentioned above, MIGRATE_CMA can be used only
for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as
much as possible.  Otherwise, we needlessly take up precious freepages
with other migratetype and increase chance of fragmentation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
David Rientjes
30467e0b3b mm, hotplug: fix concurrent memory hot-add deadlock
There's a deadlock when concurrently hot-adding memory through the probe
interface and switching a memory block from offline to online.

When hot-adding memory via the probe interface, add_memory() first takes
mem_hotplug_begin() and then device_lock() is later taken when registering
the newly initialized memory block.  This creates a lock dependency of (1)
mem_hotplug.lock (2) dev->mutex.

When switching a memory block from offline to online, dev->mutex is first
grabbed in device_online() when the write(2) transitions an existing
memory block from offline to online, and then online_pages() will take
mem_hotplug_begin().

This creates a lock inversion between mem_hotplug.lock and dev->mutex.
Vitaly reports that this deadlock can happen when kworker handling a probe
event races with systemd-udevd switching a memory block's state.

This patch requires the state transition to take mem_hotplug_begin()
before dev->mutex.  Hot-adding memory via the probe interface creates a
memory block while holding mem_hotplug_begin(), there is no way to take
dev->mutex first in this case.

online_pages() and offline_pages() are only called when transitioning
memory block state.  We now require that mem_hotplug_begin() is taken
before calling them -- this requires exporting the mem_hotplug_begin() and
mem_hotplug_done() to generic code.  In all hot-add and hot-remove cases,
mem_hotplug_begin() is done prior to device_online().  This is all that is
needed to avoid the deadlock.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Andrew Morton
875abdb6d4 mm-cma-allocation-trigger-fix
s/CONFIG_CMA_ALIGNMENT/0/, per Joonsoo

Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
8325330b02 mm: cma: release trigger
Provides a userspace interface to trigger a CMA release.

Usage:

        echo [pages] > free

This would provide testing/fuzzing access to the CMA release paths.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.cz: fix build]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
26b02a1f96 mm: cma: allocation trigger
Provides a userspace interface to trigger a CMA allocation.

Usage:

        echo [pages] > alloc

This would provide testing/fuzzing access to the CMA allocation paths.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
28b24c1fc8 mm: cma: debugfs interface
I've noticed that there is no interfaces exposed by CMA which would let me
fuzz what's going on in there.

This small patchset exposes some information out to userspace, plus adds
the ability to trigger allocation and freeing from userspace.

This patch (of 3):

Implement a simple debugfs interface to expose information about CMA areas
in the system.

Useful for testing/sanity checks for CMA since it was impossible to
previously retrieve this information in userspace.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sheng Yong
19c07d5e04 memory hotplug: use macro to switch between section and pfn
Use macro section_nr_to_pfn() to switch between section and pfn, instead
of open-coding it.  No semantic changes.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Johannes Weiner
1575e68b3c mm: memcontrol: update copyright notice
Add myself to the list of copyright holders.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Baoquan He
7fc825b456 mm/memblock.c: rename local variable of memblock_type to `type'
A small cleanup.  Seems in e3239ff9 ("memblock: Rename memblock_region to
memblock_type and memblock_property to memblock_region") this one was
missed.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
acc3c8d15e mm: move mm_populate()-related code to mm/gup.c
It's odd that we have populate_vma_page_range() and __mm_populate() in
mm/mlock.c.  It's implementation of generic memory population and mlocking
is one of possible side effect, if VM_LOCKED is set.

__get_user_pages() is core of the implementation.  Let's move the code
into mm/gup.c.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
c561259ca7 mm: move gup() -> posix mlock() error conversion out of __mm_populate
This is praparation to moving mm_populate()-related code out of
mm/mlock.c.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
fc05f56621 mm: rename __mlock_vma_pages_range() to populate_vma_page_range()
__mlock_vma_pages_range() doesn't necessarily mlock pages.  It depends on
vma flags.  The same codepath is used for MAP_POPULATE.

Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

This patch also drops mlock_vma_pages_range() references from
documentation.  It has gone in cea10a19b797 ("mm: directly use
__mlock_vma_pages_range() in find_extend_vma()").

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
84d33df279 mm: rename FOLL_MLOCK to FOLL_POPULATE
After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.

Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Fabian Frederick
c21a6daf46 slob: make slob_alloc_node() static and remove EXPORT_SYMBOL()
slob_alloc_node() is only used in slob.c.  Remove the EXPORT_SYMBOL and
make slob_alloc_node() static.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Joe Perches
6f6528a163 slub: use bool function return values of true/false not 1/0
Use the normal return values for bool functions

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Chris J Arges
08303a73c6 mm/slub.c: parse slub_debug O option in switch statement
By moving the O option detection into the switch statement, we allow this
parameter to be combined with other options correctly.  Previously options
like slub_debug=OFZ would only detect the 'o' and use DEBUG_DEFAULT_FLAGS
to fill in the rest of the flags.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Geert Uytterhoeven
ef2a5153b4 mm/migrate: mark unmap_and_move() "noinline" to avoid ICE in gcc 4.7.3
With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :

    mm/migrate.c: In function `migrate_pages':
    mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <file:///usr/share/doc/gcc-4.7/README.Bugs> for instructions.
    Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
    make[1]: *** [mm/migrate.o] Error 1
    make: *** [mm/migrate.o] Error 2

Mark unmap_and_move() (which is used in a single place only) "noinline"
to work around this compiler bug.

[akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
[khilman@kernel.org: fine-tune compiler versions]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: Kevin Hilman <khilman@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Gerald Schaefer
9753412701 mm/hugetlb: use pmd_page() in follow_huge_pmd()
Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around
follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte
layout differ and using pte_page() on a huge pmd will return wrong
results.  Using pmd_page() instead fixes this.

All architectures that were touched by that commit have pmd_page()
defined, so this should not break anything on other architectures.

Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*"
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>, Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:56 -07:00
Linus Torvalds
ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
Linus Torvalds
45141eeafe Merge branch 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Workqueue now prints debug information at the end of sysrq-t which
  should be helpful when tracking down suspected workqueue stalls.  It
  only prints out the ones with something currently going on so it
  shouldn't add much output in most cases"

* 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Reorder sysfs code
  percpu: Fix trivial typos in comments
  workqueue: dump workqueues on sysrq-t
  workqueue: keep track of the flushing task and pool manager
  workqueue: make the workqueues list RCU walkable
2015-04-13 16:19:18 -07:00
Al Viro
2ba48ce513 mirror O_APPEND and O_DIRECT into iocb->ki_flags
... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
3309dd04cb switch generic_write_checks() to iocb and iter
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/ext4/file.c
2015-04-11 22:30:21 -04:00
Al Viro
dfea934575 Merge branch 'for-linus' into for-next 2015-04-11 22:29:51 -04:00
Al Viro
0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Al Viro
5f380c7fa7 lift generic_write_checks() into callers of __generic_file_write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:47 -04:00
Al Viro
0b8def9d6d __generic_file_write_iter: keep ->ki_pos and return value consistent
A side effect worth noting: in O_APPEND case we set ->ki_pos early,
so if it turns out to be an error or a zero-length write, we'll
end up with ->ki_pos modified.  Safe, since all callers never
look at the ->ki_pos after the call of __generic_file_write_iter()
returning non-positive, all the way to caller of ->write_iter() and
those discard ->ki_pos when getting that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:47 -04:00
Omar Sandoval
22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
6e242a1cee nommu: use __vfs_read()
... instead of open-coding the call of ->read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:27:56 -04:00
Al Viro
39c853ebfe Merge branch 'for-davem' into for-next 2015-04-11 22:27:19 -04:00
Al Viro
17d17e7282 process_vm_access: switch to {compat_,}import_iovec()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:27:12 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Al Viro
237dae8890 Merge branch 'iocb' into for-davem
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:01:38 -04:00
Al Viro
b2edffdd91 fix mremap() vs. ioctx_kill() race
teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed

Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-06 17:50:59 -04:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Linus Torvalds
9c8e30d12d Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: numa: mark huge PTEs young when clearing NUMA hinting faults
  mm: numa: slow PTE scan rate if migration failures occur
  mm: numa: preserve PTE write permissions across a NUMA hinting fault
  mm: numa: group related processes based on VMA flags instead of page table flags
  hfsplus: fix B-tree corruption after insertion at position 0
  MAINTAINERS: add Jan as DMI/SMBIOS support maintainer
  fs/affs/file.c: unlock/release page on error
  mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
  mm/slub: fix lockups on PREEMPT && !SMP kernels
  mm/memory hotplug: postpone the reset of obsolete pgdat
  MAINTAINERS: correct rtc armada38x pattern entry
  mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
  mm: fix anon_vma->degree underflow in anon_vma endless growing prevention
  drivers/rtc/rtc-mrst: fix suspend/resume
  aoe: update aoe maintainer information
2015-03-25 16:21:17 -07:00
Mel Gorman
b7b04004ec mm: numa: mark huge PTEs young when clearing NUMA hinting faults
Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.

Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced.  This
could be addressed but beyond the scope of this series which is aimed at
Dave Chinners shrink workload that is unlikely to be affected by this
issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman
074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman
b191f9b106 mm: numa: preserve PTE write permissions across a NUMA hinting fault
Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again.  This patch preserves the writable bit when
trapping NUMA hinting faults.  The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

  autonumabench
                                             4.0.0-rc4             4.0.0-rc4
                                              baseline              preserve
  Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
  Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
  Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
  Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
  Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
  Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
  Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
  Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)

               4.0.0-rc4   4.0.0-rc4
                baseline    preserve
  User          44383.95    43971.89
  System          252.61      201.24
  Elapsed         998.68     1000.94

  Minor Faults   2597249     1981230
  Major Faults       365         364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

                                      4.0.0-rc4             4.0.0-rc4
                                       baseline              preserve
  Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
  Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse.  The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse.  It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman
bea66fbd11 mm: numa: group related processes based on VMA flags instead of page table flags
These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd").  It was known that the
performance in 3.19 was still better even if is far less safe.  This
series aims to restore the performance without compromising on safety.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

  autonumabench
                                                3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                               vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
  Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
  Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
  Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
  Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
  Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
  Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
  Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)

                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  User        46334.60    46391.94    44383.95    43971.89    44372.12
  System        252.84      284.66      252.61      201.24      249.00
  Elapsed      1062.14     1050.96      998.68     1000.94     1026.78

Overall the system CPU usage is comparable and the test is naturally a
bit variable.  The slowing of the scanner hurts numa01 but on this
machine it is an adverse workload and patches that dramatically help it
often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                   2097811     2656646     2597249     1981230     1636841
  Major Faults                       362         450         365         364         365

Note the impact preserving the write bit across protection updates and
fault reduces faults.

  NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
  NUMA alloc miss                      0           0           0           0           0
  NUMA interleave hit                  0           0           0           0           0
  NUMA alloc local               1228514     1216317     1190871     1177448     1199021
  NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
  NUMA huge PMD updates           479530      468448      464868      477573      224487
  NUMA page range updates      491225557   479886983   476207932   489222218   229950144
  NUMA hint faults                659753      656503      641678      656926      294842
  NUMA hint local faults          381604      373963      360478      337585      186249
  NUMA hint local percent             57          56          56          51          63
  NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
  AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%

Here the impact of slowing the PTE scanner on migratrion failures is
obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
massively reduced even though the headline performance is very similar.

As xfsrepair was the reported workload here is the impact of the series
on it.

  xfsrepair
                                         3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                        vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
  Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
  Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
  Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
  Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
  Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
  Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
  Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
  Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
  Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
  Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
  Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
  CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
  CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
  CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
  CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
  Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
  Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
  Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
  Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU
usage when running xfsrepair.  Note that on my machine the overhead was
45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
preserve the write bit across faults, it's only 2.51% higher on average.
With the full series applied, system CPU usage is 24.6% lower on
average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates.  Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                 153466827   254507978   249163829   153501373   105737890
  Major Faults                       610         702         690         649         724
  NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
  NUMA huge PMD updates           129294       85044      106921      127246       79887
  NUMA pages migrated           21938995    29705270    28594162    22687324    16258075

                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
  Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
  Mean sdb-await         25.92        5.09        5.33        5.02        5.22
  Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
  Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
  Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
  Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
  Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
  Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
  Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
  Max  sdb-await        168.24       39.28       49.22       44.64       65.62
  Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
  Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
  Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
  Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
  Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15

FWIW, I also checked SPECjbb in different configurations but it's
similar observations -- minor faults lower, PTE update activity lower
and performance is roughly comparable against 3.19.

This patch (of 3):

Threads that share writable data within pages are grouped together as
related tasks.  This decision is based on whether the PTE is marked
dirty which is subject to timing races between the PTE scanner update
and when the application writes the page.  If the page is file-backed,
then background flushes and sync also affect placement.  This is
unpredictable behaviour which is impossible to reason about so this
patch makes grouping decisions based on the VMA flags.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Laura Abbott
cfa8694382 mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
Commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on
isolated pageblock") changed the logic of unset_migratetype_isolate to
check the buddy allocator and explicitly call __free_pages to merge.

The page that is being freed in this path never had prep_new_page called
so set_page_refcounted is called explicitly but there is no call to
kernel_map_pages.  With the default kernel_map_pages this is mostly
harmless but if kernel_map_pages does any manipulation of the page
tables (unmapping or setting pages to read only) this may trigger a
fault:

    alloc_contig_range test_pages_isolated(ceb00, ced00) failed
    Unable to handle kernel paging request at virtual address ffffffc0cec00000
    pgd = ffffffc045fc4000
    [ffffffc0cec00000] *pgd=0000000000000000
    Internal error: Oops: 9600004f [#1] PREEMPT SMP
    Modules linked in: exfatfs
    CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
    task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
    PC is at memset+0xc8/0x1c0
    LR is at kernel_map_pages+0x1ec/0x244

Fix this by calling kernel_map_pages to ensure the page is set in the
page table properly

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Mark Rutland
859b7a0e89 mm/slub: fix lockups on PREEMPT && !SMP kernels
Commit 9aabf810a67c ("mm/slub: optimize alloc/free fastpath by removing
preemption on/off") introduced an occasional hang for kernels built with
CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

    do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload).  On arm64 the generated assembly
looks like:

         ldr     x4, [x0,#8]
  loop:
         ldr     x1, [x0,#8]
         cmp     x1, x4
         b.ne    loop

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and an allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations/frees on the same CPU.

This patch changes the loop condition to access c->tid with READ_ONCE.
This ensures that the value is reloaded even when the compiler would
otherwise assume it could cache the value, and also ensures that the
load will not be torn.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Gu Zheng
b0dc3a342a mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat->node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Naoya Horiguchi
f683739539 mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
walk_page_test() is purely pagewalk's internal stuff, and its positive
return values are not intended to be passed to the callers of pagewalk.

However, in the current code if the last vma in the do-while loop in
walk_page_range() happens to return a positive value, it leaks outside
walk_page_range().  So the user visible effect is invalid/unexpected
return value (according to the reporter, mbind() causes it.)

This patch fixes it simply by reinitializing the return value after
checked.

Another exposed interface, walk_page_vma(), already returns 0 for such
cases so no problem.

Fixes: fafaa4264eba ("pagewalk: improve vma handling")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kazutomo Yoshii <kazutomo.yoshii@gmail.com>
Reported-by: Kazutomo Yoshii <kazutomo.yoshii@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00