11021 Commits

Author SHA1 Message Date
Arnd Bergmann
23f919d4ad shmem: avoid maybe-uninitialized warning
After enabling -Wmaybe-uninitialized warnings, we get a false-postive
warning for shmem:

  mm/shmem.c: In function `shmem_getpage_gfp':
  include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This can be easily avoided, since the correct 'info' pointer is known at
the time we first enter the function, so we can simply move the
initialization up.  Moving it before the first label avoids the warning
and lets us remove two later initializations.

Note that the function is so hard to read that it not only confuses the
compiler, but also most readers and without this patch it could\ easily
break if one of the 'goto's changed.

Link: https://www.spinics.net/lists/kernel/msg2368133.html
Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Ming Ling
6afcf8ef0c mm, compaction: fix NR_ISOLATED_* stats for pfn based migration
Since commit bda807d44454 ("mm: migrate: support non-lru movable page
migration") isolate_migratepages_block) can isolate !PageLRU pages which
would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
heuristics based on those counters such as pgdat_reclaimable_pages resp.
too_many_isolated which would lead to unexpected stalls during the
direct reclaim without any good reason.  Note that
__alloc_contig_migrate_range can isolate a lot of pages at once.

On mobile devices such as 512M ram android Phone, it may use a big zram
swap.  In some cases zram(zsmalloc) uses too many non-lru but
migratedable pages, such as:

      MemTotal: 468148 kB
      Normal free:5620kB
      Free swap:4736kB
      Total swap:409596kB
      ZRAM: 164616kB(zsmalloc non-lru pages)
      active_anon:60700kB
      inactive_anon:60744kB
      active_file:34420kB
      inactive_file:37532kB

Fix this by only accounting lru pages to NR_ISOLATED_* in
isolate_migratepages_block right after they were isolated and we still
know they were on LRU.  Drop acct_isolated because it is called after
the fact and we've lost that information.  Batching per-cpu counter
doesn't make much improvement anyway.  Also make sure that we uncharge
only LRU pages when putting them back on the LRU in
putback_movable_pages resp.  when unmap_and_move migrates the page.

[mhocko@suse.com: replace acct_isolated() with direct counting]
Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
Signed-off-by: Ming Ling <ming.ling@spreadtrum.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Michal Hocko
6d8409580b mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
__GFP_THISNODE is documented to enforce the allocation to be satisified
from the requested node with no fallbacks or placement policy
enforcements.  policy_zonelist seemingly breaks this semantic if the
current policy is MPOL_MBIND and instead of taking the node it will
fallback to the first node in the mask if the requested one is not in
the mask.  This is confusing to say the least because it fact we
shouldn't ever go that path.  First tasks shouldn't be scheduled on CPUs
with nodes outside of their mempolicy binding.  And secondly
policy_zonelist is called only from 3 places:

 - huge_zonelist - never should do __GFP_THISNODE when going this path

 - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either

 - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
   used

So we shouldn't even need to care about this possibility and can drop
the confusing code.  Let's keep a WARN_ON_ONCE in place to catch
potential users and fix them up properly (aka use a different allocation
function which ignores mempolicy).

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
David Rientjes
fd60775aea mm, thp: avoid unlikely branches for split_huge_pmd
While doing MADV_DONTNEED on a large area of thp memory, I noticed we
encountered many unlikely() branches in profiles for each backing
hugepage.  This is because zap_pmd_range() would call split_huge_pmd(),
which rechecked the conditions that were already validated, but as part
of an unlikely() branch.

Avoid the unlikely() branch when in a context where pmd is known to be
good for __split_huge_pmd() directly.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
zijun_hu
3f5000693f mm/vmalloc.c: simplify /proc/vmallocinfo implementation
Many seq_file helpers exist for simplifying implementation of virtual
files especially, for /proc nodes.  however, the helpers for iteration
over list_head are available but aren't adopted to implement
/proc/vmallocinfo currently.

Simplify /proc/vmallocinfo implementation by using existing seq_file
helpers.

Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Minchan Kim
29fac03bef mm: make unreserve highatomic functions reliable
Currently, unreserve_highatomic_pageblock bails out if it found
highatomic pageblock regardless of really moving free pages from the one
so that it could mitigate unreserve logic's goal which saves OOM of a
process.

This patch makes unreserve functions bail out only if it moves some
pages out of !highatomic free list to avoid such false positive.

Another potential problem is that by race between page freeing and
reserve highatomic function, pages could be in highatomic free list even
though the pageblock is !high atomic migratetype.  In that case,
unreserve_highatomic_pageblock can be void if count of highatomic
reserve is less than pageblock_nr_pages.  We could solve it simply via
draining all of reserved pages before the OOM.  It would have a
safeguard role to exhuast reserved pages before converging to OOM.

Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Minchan Kim
04c8716f7b mm: try to exhaust highatomic reserve before the OOM
I got OOM report from production team with v4.4 kernel.  It had enough
free memory but failed to allocate GFP_KERNEL order-0 page and finally
encountered OOM kill.  It occured during QA process which launches
several apps, switching and so on.  It happned rarely.  IOW, In normal
situation, it was not a problem but if we are unluck so that several
apps uses peak memory at the same time, it can happen.  If we manage to
pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily.  Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

  balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
  balloon cpuset=/ mems_allowed=0
  CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
    dump_stack+0x63/0x90
    dump_header+0x5c/0x1ce
    oom_kill_process+0x22e/0x400
    out_of_memory+0x1ac/0x210
    __alloc_pages_nodemask+0x101e/0x1040
    handle_mm_fault+0xa0a/0xbf0
    __do_page_fault+0x1dd/0x4d0
    trace_do_page_fault+0x43/0x130
    do_async_page_fault+0x1a/0xa0
    async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:383949 inactive_anon:106724 isolated_anon:0
   active_file:15 inactive_file:44 isolated_file:0
   unevictable:0 dirty:0 writeback:24 unstable:0
   slab_reclaimable:2483 slab_unreclaimable:3326
   mapped:0 shmem:0 pagetables:1906 bounce:0
   free:6898 free_pcp:291 free_cma:0
  Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
  DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 1952 1952 1952
  DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
  DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
  51131 total pagecache pages
  50795 pages in swap cache
  Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
  Free swap  = 8kB
  Total swap = 255996kB
  524158 pages RAM
  0 pages HighMem/MovableOnly
  12658 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned

Another example exceeded the limit by the race is

  in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
  CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
    dump_stack+0x63/0x90
    warn_alloc_failed+0xdb/0x130
    __alloc_pages_nodemask+0x4d6/0xdb0
    new_slab+0x339/0x490
    ___slab_alloc.constprop.74+0x367/0x480
    __slab_alloc.constprop.73+0x20/0x40
    __kmalloc+0x1a4/0x1e0
    alloc_indirect.isra.14+0x1d/0x50
    virtqueue_add_sgs+0x1c4/0x470
    __virtblk_add_req+0xae/0x1f0
    virtio_queue_rq+0x12d/0x290
    __blk_mq_run_hw_queue+0x239/0x370
    blk_mq_run_hw_queue+0x8f/0xb0
    blk_mq_insert_requests+0x18c/0x1a0
    blk_mq_flush_plug_list+0x125/0x140
    blk_flush_plug_list+0xc7/0x220
    blk_finish_plug+0x2c/0x40
    __do_page_cache_readahead+0x196/0x230
    filemap_fault+0x448/0x4f0
    ext4_filemap_fault+0x36/0x50
    __do_fault+0x75/0x140
    handle_mm_fault+0x84d/0xbe0
    __do_page_fault+0x1dd/0x4d0
    trace_do_page_fault+0x43/0x130
    do_async_page_fault+0x1a/0xa0
    async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:363826 inactive_anon:121283 isolated_anon:32
   active_file:65 inactive_file:152 isolated_file:0
   unevictable:0 dirty:0 writeback:46 unstable:0
   slab_reclaimable:2778 slab_unreclaimable:3070
   mapped:112 shmem:0 pagetables:1822 bounce:0
   free:9469 free_pcp:231 free_cma:0
  Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
  DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 1952 1952 1952
  DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
  DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
  2775 total pagecache pages
  2536 pages in swap cache
  Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
  Free swap  = 108744kB
  Total swap = 255996kB
  524158 pages RAM
  0 pages HighMem/MovableOnly
  12648 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned

It's weird to show that zone has enough free memory above min watermark
but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
pages.  As last resort, try to unreserve highatomic pages again and if
it has moved pages to non-highatmoc free list, retry reclaim once more.

Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Minchan Kim
4855e4a7f2 mm: prevent double decrease of nr_reserved_highatomic
There is race between page freeing and unreserved highatomic.

 CPU 0				    CPU 1

    free_hot_cold_page
      mt = get_pfnblock_migratetype
      set_pcppage_migratetype(page, mt)
    				    unreserve_highatomic_pageblock
    				    spin_lock_irqsave(&zone->lock)
    				    move_freepages_block
    				    set_pageblock_migratetype(page)
    				    spin_unlock_irqrestore(&zone->lock)
      free_pcppages_bulk
        __free_one_page(mt) <- mt is stale

By above race, a page on CPU 0 could go non-highorderatomic free list
since the pageblock's type is changed.  By that, unreserve logic of
highorderatomic can decrease reserved count on a same pageblock severak
times and then it will make mismatch between nr_reserved_highatomic and
the number of reserved pageblock.

So, this patch verifies whether the pageblock is highatomic or not and
decrease the count only if the pageblock is highatomic.

Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Minchan Kim
88ed365ea2 mm: don't steal highatomic pageblock
Patch series "use up highorder free pages before OOM", v3.

I got OOM report from production team with v4.4 kernel.  It had enough
free memory but failed to allocate GFP_KERNEL order-0 page and finally
encountered OOM kill.  It occured during QA process which launches
several apps, switching and so on.  It happned rarely.  IOW, In normal
situation, it was not a problem but if we are unluck so that several
apps uses peak memory at the same time, it can happen.  If we manage to
pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily.  Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

  balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
  balloon cpuset=/ mems_allowed=0
  CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
    dump_stack+0x63/0x90
    dump_header+0x5c/0x1ce
    oom_kill_process+0x22e/0x400
    out_of_memory+0x1ac/0x210
    __alloc_pages_nodemask+0x101e/0x1040
    handle_mm_fault+0xa0a/0xbf0
    __do_page_fault+0x1dd/0x4d0
    trace_do_page_fault+0x43/0x130
    do_async_page_fault+0x1a/0xa0
    async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:383949 inactive_anon:106724 isolated_anon:0
   active_file:15 inactive_file:44 isolated_file:0
   unevictable:0 dirty:0 writeback:24 unstable:0
   slab_reclaimable:2483 slab_unreclaimable:3326
   mapped:0 shmem:0 pagetables:1906 bounce:0
   free:6898 free_pcp:291 free_cma:0
  Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
  DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 1952 1952 1952
  DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
  DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
  51131 total pagecache pages
  50795 pages in swap cache
  Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
  Free swap  = 8kB
  Total swap = 255996kB
  524158 pages RAM
  0 pages HighMem/MovableOnly
  12658 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned

Another example exceeded the limit by the race is

  in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
  CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
    dump_stack+0x63/0x90
    warn_alloc_failed+0xdb/0x130
    __alloc_pages_nodemask+0x4d6/0xdb0
    new_slab+0x339/0x490
    ___slab_alloc.constprop.74+0x367/0x480
    __slab_alloc.constprop.73+0x20/0x40
    __kmalloc+0x1a4/0x1e0
    alloc_indirect.isra.14+0x1d/0x50
    virtqueue_add_sgs+0x1c4/0x470
    __virtblk_add_req+0xae/0x1f0
    virtio_queue_rq+0x12d/0x290
    __blk_mq_run_hw_queue+0x239/0x370
    blk_mq_run_hw_queue+0x8f/0xb0
    blk_mq_insert_requests+0x18c/0x1a0
    blk_mq_flush_plug_list+0x125/0x140
    blk_flush_plug_list+0xc7/0x220
    blk_finish_plug+0x2c/0x40
    __do_page_cache_readahead+0x196/0x230
    filemap_fault+0x448/0x4f0
    ext4_filemap_fault+0x36/0x50
    __do_fault+0x75/0x140
    handle_mm_fault+0x84d/0xbe0
    __do_page_fault+0x1dd/0x4d0
    trace_do_page_fault+0x43/0x130
    do_async_page_fault+0x1a/0xa0
    async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:363826 inactive_anon:121283 isolated_anon:32
   active_file:65 inactive_file:152 isolated_file:0
   unevictable:0 dirty:0 writeback:46 unstable:0
   slab_reclaimable:2778 slab_unreclaimable:3070
   mapped:112 shmem:0 pagetables:1822 bounce:0
   free:9469 free_pcp:231 free_cma:0
  Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
  DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 1952 1952 1952
  DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
  DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
  2775 total pagecache pages
  2536 pages in swap cache
  Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
  Free swap  = 108744kB
  Total swap = 255996kB
  524158 pages RAM
  0 pages HighMem/MovableOnly
  12648 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned

During the investigation, I found some problems with highatomic so this
patch aims to solve the problems and the final goal is to unreserve
every highatomic free pages before the OOM kill.

This patch (of 4):

In page freeing path, migratetype is racy so that a highorderatomic page
could free into non-highorderatomic free list.  If that page is
allocated, VM can change the pageblock from higorderatomic to something.
In that case, highatomic pageblock accounting is broken so it doesn't
work(e.g., VM cannot reserve highorderatomic pageblocks any more
although it doesn't reach 1% limit).

So, this patch prohibits the changing from highatomic to other type.
It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
array so stealing will only happen due to unexpected races which is
really rare.  Also, such prohibiting keeps highatomic pageblock more
longer so it would be better for highorderatomic page allocation.

Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Andreas Platschek
22901c6c9f kmemleak: fix reference to Documentation
Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
this fixes the reference to the new location.

Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.at
Signed-off-by: Andreas Platschek <andreas.platschek@opentech.at>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Aneesh Kumar K.V
8bea805207 mm/hugetlb.c: use huge_pte_lock instead of opencoding the lock
No functional change by this patch.

Link: http://lkml.kernel.org/r/20161018090234.22574-1-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Aneesh Kumar K.V
3999f52e31 mm/hugetlb.c: use the right pte val for compare in hugetlb_cow
We cannot use the pte value used in set_pte_at for pte_same comparison,
because archs like ppc64, filter/add new pte flag in set_pte_at.
Instead fetch the pte value inside hugetlb_cow.  We are comparing pte
value to make sure the pte didn't change since we dropped the page table
lock.  hugetlb_cow get called with page table lock held, and we can take
a copy of the pte value before we drop the page table lock.

With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
previous mapping (huge_pte_none entries), by forcing a cow in the fault
path.  This avoid take an addition fault to covert a read-only mapping
to read/write.  Here we were comparing a recently instantiated pte (via
set_pte_at) to the pte values from linux page table.  As explained above
on ppc64 such pte_same check returned wrong result, resulting in us
taking an additional fault on ppc64.

Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit")
Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Tobias Klauser
771ab4302c mm/gup.c: make unnecessarily global vma_permits_fault() static
Make vma_permits_fault() static as it is only used in mm/gup.c

This fixes a sparse warning.

Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.ch
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Shaohua Li
5f33a0803b mm/vmscan.c: set correct defer count for shrinker
Our system uses significantly more slab memory with memcg enabled with
the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
have two memcg for one shrinker.  In do_shrink_slab:

1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
   is 1024, then no memory is freed.  nr_deferred = 700

2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
   total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
   nr_deferred = 10.

The deferred share of cg1 is lost in this case.  kswapd will free no
memory even run above steps again and again.

The fix makes sure one memcg's deferred share isn't lost.

Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Andi Kleen
3e32158767 mm/mprotect.c: don't touch single threaded PTEs which are on the right node
We had some problems with pages getting unmapped in single threaded
affinitized processes.  It was tracked down to NUMA scanning.

In this case it doesn't make any sense to unmap pages if the process is
single threaded and the page is already on the node the process is
running on.

Add a check for this case into the numa protection code, and skip
unmapping if true.

In theory the process could be migrated later, but we will eventually
rescan and unmap and migrate then.

In theory this could be made more fancy: remembering this state per
process or even whole mm.  However that would need extra tracking and be
more complicated, and the simple check seems to work fine so far.

[ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
  Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
David Rientjes
bf00bd3458 mm, slab: maintain total slab count instead of active count
Rather than tracking the number of active slabs for each node, track the
total number of slabs.  This is a minor improvement that avoids active
slab tracking when a slab goes from free to partial or partial to free.

For slab debugging, this also removes an explicit free count since it
can easily be inferred by the difference in number of total objects and
number of active objects.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:07 -08:00
Greg Thelen
f728b0a5d7 mm, slab: faster active and free stats
Reading /proc/slabinfo or monitoring slabtop(1) can become very
expensive if there are many slab caches and if there are very lengthy
per-node partial and/or free lists.

Commit 07a63c41fa1f ("mm/slab: improve performance of gathering slabinfo
stats") addressed the per-node full lists which showed a significant
improvement when no objects were freed.  This patch has the same
motivation and optimizes the remainder of the usecases where there are
very lengthy partial and free lists.

This patch maintains per-node active_slabs (full and partial) and
free_slabs rather than iterating the lists at runtime when reading
/proc/slabinfo.

When allocating 100GB of slab from a test cache where every slab page is
on the partial list, reading /proc/slabinfo (includes all other slab
caches on the system) takes ~247ms on average with 48 samples.

As a result of this patch, the same read takes ~0.856ms on average.

[rientjes@google.com: changelog]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Thomas Garnier
e70954fd6d mm/slab_common.c: check kmem_create_cache flags are common
Verify that kmem_create_cache flags are not allocator specific.  It is
done before removing flags that are not available with the current
configuration.

The current kmem_cache_create removes incorrect flags but do not
validate the callers are using them right.  This change will ensure that
callers are not trying to create caches with flags that won't be used
because allocator specific.

Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Arnd Bergmann
84582c8ab9 slub: avoid false-postive warning
The slub allocator gives us some incorrect warnings when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
prevents it from seeing that the return code matches what it was before:

  mm/slub.c: In function `kmem_cache_free_bulk':
  mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I have not been able to come up with a perfect way for dealing with
this, the three options I see are:

 - add a bogus initialization, which would increase the runtime overhead
 - replace unlikely() with unlikely_notrace()
 - remove the unlikely() annotation completely

I checked the object code for a typical x86 configuration and the last
two cases produce the same result, so I went for the last one, which is
the simplest.

Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Vladimir Davydov
89e364db71 slub: move synchronize_sched out of slab_mutex on shrink
synchronize_sched() is a heavy operation and calling it per each cache
owned by a memory cgroup being destroyed may take quite some time.  What
is worse, it's currently called under the slab_mutex, stalling all works
doing cache creation/destruction.

Actually, there isn't much point in calling synchronize_sched() for each
cache - it's enough to call it just once - after setting cpu_partial for
all caches and before shrinking them.  This way, we can also move it out
of the slab_mutex, which we have to hold for iterating over the slab
cache list.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Vladimir Davydov
13583c3d32 mm: memcontrol: use special workqueue for creating per-memcg caches
Creating a lot of cgroups at the same time might stall all worker
threads with kmem cache creation works, because kmem cache creation is
done with the slab_mutex held.  The problem was amplified by commits
801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
increased the maximal time the slab_mutex can be held.

To prevent that from happening, let's use a special ordered single
threaded workqueue for kmem cache creation.  This shouldn't introduce
any functional changes regarding how kmem caches are created, as the
work function holds the global slab_mutex during its whole runtime
anyway, making it impossible to run more than one work at a time.  By
using a single threaded workqueue, we just avoid creating a thread per
each work.  Ordering is required to avoid a situation when a cgroup's
work is put off indefinitely because there are other cgroups to serve,
in other words to guarantee fairness.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Linus Torvalds
5645688f9d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this development cycle were:

   - a large number of call stack dumping/printing improvements: higher
     robustness, better cross-context dumping, improved output, etc.
     (Josh Poimboeuf)

   - vDSO getcpu() performance improvement for future Intel CPUs with
     the RDPID instruction (Andy Lutomirski)

   - add two new Intel AVX512 features and the CPUID support
     infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
     He Chen)

   - more copy-user unification (Borislav Petkov)

   - entry code assembly macro simplifications (Alexander Kuleshov)

   - vDSO C/R support improvements (Dmitry Safonov)

   - misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  scripts/decode_stacktrace.sh: Fix address line detection on x86
  x86/boot/64: Use defines for page size
  x86/dumpstack: Make stack name tags more comprehensible
  selftests/x86: Add test_vdso to test getcpu()
  x86/vdso: Use RDPID in preference to LSL when available
  x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
  x86/cpufeatures: Enable new AVX512 cpu features
  x86/cpuid: Provide get_scattered_cpuid_leaf()
  x86/cpuid: Cleanup cpuid_regs definitions
  x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
  x86/unwind: Ensure stack grows down
  x86/vdso: Set vDSO pointer only after success
  x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
  x86/unwind: Detect bad stack return address
  x86/dumpstack: Warn on stack recursion
  x86/unwind: Warn on bad frame pointer
  x86/decoder: Use stderr if insn sanity test fails
  x86/decoder: Use stdout if insn decoder test is successful
  mm/page_alloc: Remove kernel address exposure in free_reserved_area()
  x86/dumpstack: Remove raw stack dump
  ...
2016-12-12 13:49:57 -08:00
Rafael J. Wysocki
631ddaba59 Merge branches 'pm-sleep' and 'powercap'
* pm-sleep:
  PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
  x86/suspend: fix false positive KASAN warning on suspend/resume
  PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag
  PM / sleep: System sleep state selection interface rework
  PM / hibernate: Verify the consistent of e820 memory map by md5 digest

* powercap:
  powercap / RAPL: Add Knights Mill CPUID
  powercap/intel_rapl: fix and tidy up error handling
  powercap/intel_rapl: Track active CPUs internally
  powercap/intel_rapl: Cleanup duplicated init code
  powercap/intel rapl: Convert to hotplug state machine
  powercap/intel_rapl: Propagate error code when registration fails
  powercap/intel_rapl: Add missing domain data update on hotplug
2016-12-12 20:46:35 +01:00
Linus Torvalds
0719dbf5e1 Merge branch 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull mm/PAT cleanup from Ingo Molnar:
 "A single cleanup for a generic interface that was originally
  introduced for PAT"

* 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pat, mm: Make track_pfn_insert() return void
2016-12-12 11:14:52 -08:00
Miklos Szeredi
dfeef68862 vfs: remove ".readlink = generic_readlink" assignments
If .readlink == NULL implies generic_readlink().

Generated by:

to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Linus Torvalds
10d20bd25e shmem: fix shm fallocate() list corruption
The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
want to race with generating new pages by faulting them in.

However, the wait-queue used to delay the page faulting has a serious
problem: the wait queue head (in shmem_fallocate()) is allocated on the
stack, and the code expects that "wake_up_all()" will make sure that all
the queue entries are gone before the stack frame is de-allocated.

And that is not at all necessarily the case.

Yes, a normal wake-up sequence will remove the wait-queue entry that
caused the wakeup (see "autoremove_wake_function()"), but the key
wording there is "that caused the wakeup".  When there are multiple
possible wakeup sources, the wait queue entry may well stay around.

And _particularly_ in a page fault path, we may be faulting in new pages
from user space while we also have other things going on, and there may
well be other pending wakeups.

So despite the "wake_up_all()", it's not at all guaranteed that all list
entries are removed from the wait queue head on the stack.

Fix this by introducing a new wakeup function that removes the list
entry unconditionally, even if the target process had already woken up
for other reasons.  Use that "synchronous" function to set up the
waiters in shmem_fault().

This problem has never been seen in the wild afaik, but Dave Jones has
reported it on and off while running trinity.  We thought we fixed the
stack corruption with the blk-mq rq_list locking fix (commit
7fe311302f7d: "blk-mq: update hardware and software queues for sleeping
alloc"), but it turns out there was _another_ stack corruptor hiding
in the trinity runs.

Vegard Nossum (also running trinity) was able to trigger this one fairly
consistently, and made us look once again at the shmem code due to the
faults often being in that area.

Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-06 08:59:05 -08:00
Josh Poimboeuf
b53f40db59 x86/suspend: fix false positive KASAN warning on suspend/resume
Resuming from a suspend operation is showing a KASAN false positive
warning:

  BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr ffff8803867d7878
  Read of size 8 by task pm-suspend/7774
  page:ffffea000e19f5c0 count:0 mapcount:0 mapping:          (null) index:0x0
  flags: 0x2ffff0000000000()
  page dumped because: kasan: bad access detected
  CPU: 0 PID: 7774 Comm: pm-suspend Tainted: G    B           4.9.0-rc7+ #8
  Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016
  Call Trace:
    dump_stack+0x63/0x82
    kasan_report_error+0x4b4/0x4e0
    ? acpi_hw_read_port+0xd0/0x1ea
    ? kfree_const+0x22/0x30
    ? acpi_hw_validate_io_request+0x1a6/0x1a6
    __asan_report_load8_noabort+0x61/0x70
    ? unwind_get_return_address+0x11d/0x130
    unwind_get_return_address+0x11d/0x130
    ? unwind_next_frame+0x97/0xf0
    __save_stack_trace+0x92/0x100
    save_stack_trace+0x1b/0x20
    save_stack+0x46/0xd0
    ? save_stack_trace+0x1b/0x20
    ? save_stack+0x46/0xd0
    ? kasan_kmalloc+0xad/0xe0
    ? kasan_slab_alloc+0x12/0x20
    ? acpi_hw_read+0x2b6/0x3aa
    ? acpi_hw_validate_register+0x20b/0x20b
    ? acpi_hw_write_port+0x72/0xc7
    ? acpi_hw_write+0x11f/0x15f
    ? acpi_hw_read_multiple+0x19f/0x19f
    ? memcpy+0x45/0x50
    ? acpi_hw_write_port+0x72/0xc7
    ? acpi_hw_write+0x11f/0x15f
    ? acpi_hw_read_multiple+0x19f/0x19f
    ? kasan_unpoison_shadow+0x36/0x50
    kasan_kmalloc+0xad/0xe0
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc_trace+0xbc/0x1e0
    ? acpi_get_sleep_type_data+0x9a/0x578
    acpi_get_sleep_type_data+0x9a/0x578
    acpi_hw_legacy_wake_prep+0x88/0x22c
    ? acpi_hw_legacy_sleep+0x3c7/0x3c7
    ? acpi_write_bit_register+0x28d/0x2d3
    ? acpi_read_bit_register+0x19b/0x19b
    acpi_hw_sleep_dispatch+0xb5/0xba
    acpi_leave_sleep_state_prep+0x17/0x19
    acpi_suspend_enter+0x154/0x1e0
    ? trace_suspend_resume+0xe8/0xe8
    suspend_devices_and_enter+0xb09/0xdb0
    ? printk+0xa8/0xd8
    ? arch_suspend_enable_irqs+0x20/0x20
    ? try_to_freeze_tasks+0x295/0x600
    pm_suspend+0x6c9/0x780
    ? finish_wait+0x1f0/0x1f0
    ? suspend_devices_and_enter+0xdb0/0xdb0
    state_store+0xa2/0x120
    ? kobj_attr_show+0x60/0x60
    kobj_attr_store+0x36/0x70
    sysfs_kf_write+0x131/0x200
    kernfs_fop_write+0x295/0x3f0
    __vfs_write+0xef/0x760
    ? handle_mm_fault+0x1346/0x35e0
    ? do_iter_readv_writev+0x660/0x660
    ? __pmd_alloc+0x310/0x310
    ? do_lock_file_wait+0x1e0/0x1e0
    ? apparmor_file_permission+0x18/0x20
    ? security_file_permission+0x73/0x1c0
    ? rw_verify_area+0xbd/0x2b0
    vfs_write+0x149/0x4a0
    SyS_write+0xd9/0x1c0
    ? SyS_read+0x1c0/0x1c0
    entry_SYSCALL_64_fastpath+0x1e/0xad
  Memory state around the buggy address:
   ffff8803867d7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff8803867d7780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff8803867d7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f4
                                                                  ^
   ffff8803867d7880: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
   ffff8803867d7900: 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f3 f3 f3 f3 00

KASAN instrumentation poisons the stack when entering a function and
unpoisons it when exiting the function.  However, in the suspend path,
some functions never return, so their stack never gets unpoisoned,
resulting in stale KASAN shadow data which can cause later false
positive warnings like the one above.

Reported-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-06 02:22:44 +01:00
Michal Hocko
bd041733c9 mm, vmscan: add cond_resched() into shrink_node_memcg()
Boris Zhmurov has reported RCU stalls during the kswapd reclaim:

  INFO: rcu_sched detected stalls on CPUs/tasks:
   23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
   (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
  Task dump for CPU 23:
  kswapd1         R  running task        0   148      2 0x00000008
  Call Trace:
    shrink_node+0xd2/0x2f0
    kswapd+0x2cb/0x6a0
    mem_cgroup_shrink_node+0x160/0x160
    kthread+0xbd/0xe0
    __switch_to+0x1fa/0x5c0
    ret_from_fork+0x1f/0x40
    kthread_create_on_node+0x180/0x180

a closer code inspection has shown that we might indeed miss all the
scheduling points in the reclaim path if no pages can be isolated from
the LRU list.  This is a pathological case but other reports from Donald
Buczek have shown that we might indeed hit such a path:

        clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
         kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
  [...]
         kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

this is minute long snapshot which didn't take a single page from the
LRU.  It is not entirely clear why only 1303 pages have been scanned
during that time (maybe there was a heavy IRQ activity interfering).

In any case it looks like we can really hit long periods without
scheduling on non preemptive kernels so an explicit cond_resched() in
shrink_node_memcg which is independent on the reclaim operation is due.

Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Boris Zhmurov <bb@kernelpanic.ru>
Tested-by: Boris Zhmurov <bb@kernelpanic.ru>
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Reported-by: "Christopher S. Aker" <caker@theshore.net>
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-02 18:48:03 -08:00
Michal Hocko
20ab67a563 mm: workingset: fix NULL ptr in count_shadow_nodes
Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
aware") has made the workingset shadow nodes shrinker memcg aware.  The
implementation is not correct though because memcg_kmem_enabled() might
become true while we are doing a global reclaim when the sc->memcg might
be NULL which is exactly what Marek has seen:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
  IP: [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
  PGD 0
  Oops: 0000 [#1] SMP
  CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O   4.8.10-12.pvops.qubes.x86_64 #1
  task: ffff880011863b00 task.stack: ffff880011868000
  RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
  RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
  RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
  RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
  R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
  R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
  CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
  Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
  Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a> 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
  RIP  mem_cgroup_node_nr_lru_pages+0x20/0x40
   RSP <ffff88001186bc70>
  CR2: 0000000000000400
  ---[ end trace 100494b9edbdfc4d ]---

This patch fixes the issue by checking sc->memcg rather than
memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
that only memcg aware shrinkers will get non-NULL memcgs and only if
memcg_kmem_enabled is true.

Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
Tested-by: Marek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>	[4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-02 18:48:03 -08:00
Anna-Maria Gleixner
e46b1db249 mm/compaction: Convert to hotplug state machine
Install the callbacks via the state machine. Should the hotplug init fail then
no threads are spawned.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161126231350.10321-15-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:37 +01:00
Sebastian Andrzej Siewior
cab7a7e5b6 mm/zswap: Convert pool to hotplug state machine
Install the callbacks via the state machine. Multi state is used to address the
per-pool notifier. Uppon adding of the intance the callback is invoked for all
online CPUs so the manual init can go.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Seth Jennings <sjenning@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161126231350.10321-13-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:36 +01:00
Sebastian Andrzej Siewior
ad7ed7708d mm/zswap: Convert dst-mem to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Seth Jennings <sjenning@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161126231350.10321-12-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:36 +01:00
Sebastian Andrzej Siewior
215c89d055 mm/zsmalloc: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-mm@kvack.org
Cc: Minchan Kim <minchan@kernel.org>
Cc: rt@linutronix.de
Cc: Nitin Gupta <ngupta@vflare.org>
Link: http://lkml.kernel.org/r/20161126231350.10321-11-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:36 +01:00
Sebastian Andrzej Siewior
5438da977f mm/vmstat: Convert to hotplug state machine
Install the callbacks via the state machine, but do not invoke them as we
can initialize the node state without calling the callbacks on all online
CPUs.

start_shepherd_timer() is now called outside the get_online_cpus() block
which is safe as it only operates on cpu possible mask.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:35 +01:00
Sebastian Andrzej Siewior
4c501327b4 mm/vmstat: Avoid on each online CPU loops
Both iterations over online cpus can be replaced by the proper node
specific functions.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:35 +01:00
Sebastian Andrzej Siewior
76f290935b mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
Both functions are called with protection against cpu hotplug already so
*_online_cpus() could be dropped.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:35 +01:00
Kirill A. Shutemov
5cbc198ae0 mm: fix false-positive WARN_ON() in truncate/invalidate for hugetlb
Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
PUD_SIZE), not in PAGE_SIZE as other types of pages.  This means we
cannot user page_to_pgoff() to check whether we've got the right page
for the radix-tree index.

Let's introduce page_to_index() which would return radix-tree index for
given page.

We will be able to get rid of this once hugetlb will be switched to
multi-order entries.

Fixes: fc127da085c2 ("truncate: handle file thp")
Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Tested-by: Doug Nelson <doug.nelson@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 16:32:52 -08:00
Dmitry Vyukov
828347f8f9 kasan: support use-after-scope detection
Gcc revision 241896 implements use-after-scope detection.  Will be
available in gcc 7.  Support it in KASAN.

Gcc emits 2 new callbacks to poison/unpoison large stack objects when
they go in/out of scope.  Implement the callbacks and add a test.

[dvyukov@google.com: v3]
  Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com
Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 16:32:52 -08:00
Dmitry Vyukov
045d599a28 kasan: update kasan_global for gcc 7
kasan_global struct is part of compiler/runtime ABI.  gcc revision
241983 has added a new field to kasan_global struct.  Update kernel
definition of kasan_global struct to include the new field.

Without this patch KASAN is broken with gcc 7.

Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 16:32:52 -08:00
Kirill A. Shutemov
655548bf62 thp: fix corner case of munlock() of PTE-mapped THPs
The following program triggers BUG() in munlock_vma_pages_range():

	// autogenerated by syzkaller (http://github.com/google/syzkaller)
	#include <sys/mman.h>

	int main()
	{
	  mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
	  mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
	  return 0;
	}

The test-case constructs the situation when munlock_vma_pages_range()
finds PTE-mapped THP-head in the middle of page table and, by mistake,
skips HPAGE_PMD_NR pages after that.

As result, on the next iteration it hits the middle of PMD-mapped THP
and gets upset seeing mlocked tail page.

The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
during munlock_vma_page().  It would guarantee that the page is
PMD-mapped as we never mlock PTE-mapeed THPs.

Fixes: e90309c9f772 ("thp: allow mlocked THP again")
Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 16:32:52 -08:00
Jérémy Lefaure
e1465d125d mm, thp: propagation of conditional compilation in khugepaged.c
Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c")
moved code from huge_memory.c to khugepaged.c.  Some of this code should
be compiled only when CONFIG_SYSFS is enabled but the condition around
this code was not moved into khugepaged.c.

The result is a compilation error when CONFIG_SYSFS is disabled:

  mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
  mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'

This commit adds the #ifdef CONFIG_SYSFS around the code related to
sysfs.

Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 16:32:52 -08:00
Aaron Lu
a2ce2666aa mremap: move_ptes: check pte dirty after its removal
Linus found there still is a race in mremap after commit 5d1904204c99
("mremap: fix race between mremap() and page cleanning").

As described by Linus:
 "the issue is that another thread might make the pte be dirty (in the
  hardware walker, so no locking of ours will make any difference)
  *after* we checked whether it was dirty, but *before* we removed it
  from the page tables"

Fix it by moving the check after we removed it from the page table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-29 08:20:24 -08:00
Eric W. Biederman
84d77d3f06 ptrace: Don't allow accessing an undumpable mm
It is the reasonable expectation that if an executable file is not
readable there will be no way for a user without special privileges to
read the file.  This is enforced in ptrace_attach but if ptrace
is already attached before exec there is no enforcement for read-only
executables.

As the only way to read such an mm is through access_process_vm
spin a variant called ptrace_access_vm that will fail if the
target process is not being ptraced by the current process, or
the current process did not have sufficient privileges when ptracing
began to read the target processes mm.

In the ptrace implementations replace access_process_vm by
ptrace_access_vm.  There remain several ptrace sites that still use
access_process_vm as they are reading the target executables
instructions (for kernel consumption) or register stacks.  As such it
does not appear necessary to add a permission check to those calls.

This bug has always existed in Linux.

Fixes: v1.0
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 12:57:38 -06:00
Eric W. Biederman
bfedb58925 mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat

Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 11:49:48 -06:00
Aaron Lu
5d1904204c mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost.  Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it.  Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

        thread 1                           thread 2
     (bound to CPU1)                    (bound to CPU2)

  1: write 1 to addr X to get a
     writeable TLB on this CPU

                                        2: mremap starts

                                        3: move_ptes emptied PTE for addr X
                                           and setup new PTE for addr Y and
                                           then dropped PTL for X and Y

  4: page laundering for N by doing
     fadvise FADV_DONTNEED. When done,
     pageframe N is deemed clean.

  5: *write 2 to addr X

                                        6: tlb flush for addr X

  7: munmap (Y, pagesize) to make the
     page unmapped

  8: fadvise with FADV_DONTNEED again
     to kick the page off the pagecache

  9: pread the page from file to verify
     the value. If 1 is there, it means
     we have lost the written 2.

  *the write may or may not cause segmentation fault, it depends on
  if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it.  The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL.  If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range.  But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables().  But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
  real    5m14.048s
  user    32m19.800s
  sys     4m50.320s

With this commit:
  real    5m13.888s
  user    32m19.330s
  sys     4m51.200s

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-17 09:46:56 -08:00
Ingo Molnar
89a01c51cb Merge branch 'x86/cpufeature' into x86/asm, to pick up dependency
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-17 08:30:54 +01:00
Greg Kroah-Hartman
b7d91c9152 Merge 4.9-rc5 into char-misc-next
We want those fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-14 16:39:47 +01:00
Theodore Ts'o
a2f6d9c4c0 Merge branch 'dax-4.10-iomap-pmd' into origin 2016-11-13 22:02:15 -05:00
Jakub Kicinski
d7c19b066d mm: kmemleak: scan .data.ro_after_init
Limit the number of kmemleak false positives by including
.data.ro_after_init in memory scanning.  To achieve this we need to add
symbols for start and end of the section to the linker scripts.

The problem was been uncovered by commit 56989f6d8568 ("genetlink: mark
families as __ro_after_init").

Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Greg Thelen
f773e36de3 memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB
While testing OBJFREELIST_SLAB integration with pagealloc, we found a
bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
CFLGS_OBJFREELIST_SLAB.  When it happened, critical allocations needed
for loading drivers or creating new caches will fail.

The original kmem_cache is created early making OFF_SLAB not possible.
When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
is enabled it will try to enable it first under certain conditions.
Given kmem_cache(sys) reuses the original flag, you can have both flags
at the same time resulting in allocation failures and odd behaviors.

This fix discards allocator specific flags from memcg before calling
create_cache.

The bug exists since 4.6-rc1 and affects testing debug pagealloc
configurations.

Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Tested-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00