2014-10-09 15:28:37 -07:00
/*
* mm / debug . c
*
* mm / specific debug routines .
*
*/
2014-10-09 15:28:34 -07:00
# include <linux/kernel.h>
# include <linux/mm.h>
2015-04-29 14:36:05 -04:00
# include <linux/trace_events.h>
2014-10-09 15:28:34 -07:00
# include <linux/memcontrol.h>
mm, tracing: unify mm flags handling in tracepoints and printk
In tracepoints, it's possible to print gfp flags in a human-friendly
format through a macro show_gfp_flags(), which defines a translation
array and passes is to __print_flags(). Since the following patch will
introduce support for gfp flags printing in printk(), it would be nice
to reuse the array. This is not straightforward, since __print_flags()
can't simply reference an array defined in a .c file such as mm/debug.c
- it has to be a macro to allow the macro magic to communicate the
format to userspace tools such as trace-cmd.
The solution is to create a macro __def_gfpflag_names which is used both
in show_gfp_flags(), and to define the gfpflag_names[] array in
mm/debug.c.
On the other hand, mm/debug.c also defines translation tables for page
flags and vma flags, and desire was expressed (but not implemented in
this series) to use these also from tracepoints. Thus, this patch also
renames the events/gfpflags.h file to events/mmflags.h and moves the
table definitions there, using the same macro approach as for gfpflags.
This allows translating all three kinds of mm-specific flags both in
tracepoints and printk.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:52 -07:00
# include <trace/events/mmflags.h>
2016-03-15 14:56:18 -07:00
# include <linux/migrate.h>
2016-03-15 14:56:21 -07:00
# include <linux/page_owner.h>
2014-10-09 15:28:34 -07:00
mm, printk: introduce new format string for flags
In mm we use several kinds of flags bitfields that are sometimes printed
for debugging purposes, or exported to userspace via sysfs. To make
them easier to interpret independently on kernel version and config, we
want to dump also the symbolic flag names. So far this has been done
with repeated calls to pr_cont(), which is unreliable on SMP, and not
usable for e.g. sysfs export.
To get a more reliable and universal solution, this patch extends
printk() format string for pointers to handle the page flags (%pGp),
gfp_flags (%pGg) and vma flags (%pGv). Existing users of
dump_flag_names() are converted and simplified.
It would be possible to pass flags by value instead of pointer, but the
%p format string for pointers already has extensions for various kernel
structures, so it's a good fit, and the extra indirection in a
non-critical path is negligible.
[linux@rasmusvillemoes.dk: lots of good implementation suggestions]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:56 -07:00
# include "internal.h"
2016-03-15 14:56:18 -07:00
char * migrate_reason_names [ MR_TYPES ] = {
" compaction " ,
" memory_failure " ,
" memory_hotplug " ,
" syscall_or_cpuset " ,
" mempolicy_mbind " ,
" numa_misplaced " ,
" cma " ,
} ;
mm, printk: introduce new format string for flags
In mm we use several kinds of flags bitfields that are sometimes printed
for debugging purposes, or exported to userspace via sysfs. To make
them easier to interpret independently on kernel version and config, we
want to dump also the symbolic flag names. So far this has been done
with repeated calls to pr_cont(), which is unreliable on SMP, and not
usable for e.g. sysfs export.
To get a more reliable and universal solution, this patch extends
printk() format string for pointers to handle the page flags (%pGp),
gfp_flags (%pGg) and vma flags (%pGv). Existing users of
dump_flag_names() are converted and simplified.
It would be possible to pass flags by value instead of pointer, but the
%p format string for pointers already has extensions for various kernel
structures, so it's a good fit, and the extra indirection in a
non-critical path is negligible.
[linux@rasmusvillemoes.dk: lots of good implementation suggestions]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:56 -07:00
const struct trace_print_flags pageflag_names [ ] = {
__def_pageflag_names ,
{ 0 , NULL }
} ;
const struct trace_print_flags gfpflag_names [ ] = {
__def_gfpflag_names ,
{ 0 , NULL }
mm, tracing: unify mm flags handling in tracepoints and printk
In tracepoints, it's possible to print gfp flags in a human-friendly
format through a macro show_gfp_flags(), which defines a translation
array and passes is to __print_flags(). Since the following patch will
introduce support for gfp flags printing in printk(), it would be nice
to reuse the array. This is not straightforward, since __print_flags()
can't simply reference an array defined in a .c file such as mm/debug.c
- it has to be a macro to allow the macro magic to communicate the
format to userspace tools such as trace-cmd.
The solution is to create a macro __def_gfpflag_names which is used both
in show_gfp_flags(), and to define the gfpflag_names[] array in
mm/debug.c.
On the other hand, mm/debug.c also defines translation tables for page
flags and vma flags, and desire was expressed (but not implemented in
this series) to use these also from tracepoints. Thus, this patch also
renames the events/gfpflags.h file to events/mmflags.h and moves the
table definitions there, using the same macro approach as for gfpflags.
This allows translating all three kinds of mm-specific flags both in
tracepoints and printk.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:52 -07:00
} ;
mm, printk: introduce new format string for flags
In mm we use several kinds of flags bitfields that are sometimes printed
for debugging purposes, or exported to userspace via sysfs. To make
them easier to interpret independently on kernel version and config, we
want to dump also the symbolic flag names. So far this has been done
with repeated calls to pr_cont(), which is unreliable on SMP, and not
usable for e.g. sysfs export.
To get a more reliable and universal solution, this patch extends
printk() format string for pointers to handle the page flags (%pGp),
gfp_flags (%pGg) and vma flags (%pGv). Existing users of
dump_flag_names() are converted and simplified.
It would be possible to pass flags by value instead of pointer, but the
%p format string for pointers already has extensions for various kernel
structures, so it's a good fit, and the extra indirection in a
non-critical path is negligible.
[linux@rasmusvillemoes.dk: lots of good implementation suggestions]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:56 -07:00
const struct trace_print_flags vmaflag_names [ ] = {
__def_vmaflag_names ,
{ 0 , NULL }
2014-10-09 15:28:34 -07:00
} ;
2016-03-15 14:56:24 -07:00
void __dump_page ( struct page * page , const char * reason )
2014-10-09 15:28:34 -07:00
{
2016-10-07 17:01:40 -07:00
/*
* Avoid VM_BUG_ON ( ) in page_mapcount ( ) .
* page - > _mapcount space in struct page is used by sl [ aou ] b pages to
* encode own info .
*/
2016-09-19 14:44:07 -07:00
int mapcount = PageSlab ( page ) ? 0 : page_mapcount ( page ) ;
2016-01-15 16:53:42 -08:00
pr_emerg ( " page:%p count:%d mapcount:%d mapping:%p index:%#lx " ,
2016-09-19 14:44:07 -07:00
page , page_ref_count ( page ) , mapcount ,
page - > mapping , page_to_pgoff ( page ) ) ;
2016-01-15 16:53:42 -08:00
if ( PageCompound ( page ) )
pr_cont ( " compound_mapcount: %d " , compound_mapcount ( page ) ) ;
pr_cont ( " \n " ) ;
mm, printk: introduce new format string for flags
In mm we use several kinds of flags bitfields that are sometimes printed
for debugging purposes, or exported to userspace via sysfs. To make
them easier to interpret independently on kernel version and config, we
want to dump also the symbolic flag names. So far this has been done
with repeated calls to pr_cont(), which is unreliable on SMP, and not
usable for e.g. sysfs export.
To get a more reliable and universal solution, this patch extends
printk() format string for pointers to handle the page flags (%pGp),
gfp_flags (%pGg) and vma flags (%pGv). Existing users of
dump_flag_names() are converted and simplified.
It would be possible to pass flags by value instead of pointer, but the
%p format string for pointers already has extensions for various kernel
structures, so it's a good fit, and the extra indirection in a
non-critical path is negligible.
[linux@rasmusvillemoes.dk: lots of good implementation suggestions]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 14:55:56 -07:00
BUILD_BUG_ON ( ARRAY_SIZE ( pageflag_names ) ! = __NR_PAGEFLAGS + 1 ) ;
2016-03-15 14:56:24 -07:00
2016-03-15 14:55:59 -07:00
pr_emerg ( " flags: %#lx(%pGp) \n " , page - > flags , & page - > flags ) ;
2016-12-12 16:44:35 -08:00
print_hex_dump ( KERN_ALERT , " raw: " , DUMP_PREFIX_NONE , 32 ,
sizeof ( unsigned long ) , page ,
sizeof ( struct page ) , false ) ;
2014-10-09 15:28:34 -07:00
if ( reason )
pr_alert ( " page dumped because: %s \n " , reason ) ;
2016-03-15 14:55:59 -07:00
2014-12-10 15:44:58 -08:00
# ifdef CONFIG_MEMCG
if ( page - > mem_cgroup )
pr_alert ( " page->mem_cgroup:%p \n " , page - > mem_cgroup ) ;
# endif
2014-10-09 15:28:34 -07:00
}
void dump_page ( struct page * page , const char * reason )
{
2016-03-15 14:56:24 -07:00
__dump_page ( page , reason ) ;
2016-03-15 14:56:21 -07:00
dump_page_owner ( page ) ;
2014-10-09 15:28:34 -07:00
}
EXPORT_SYMBOL ( dump_page ) ;
# ifdef CONFIG_DEBUG_VM
void dump_vma ( const struct vm_area_struct * vma )
{
2014-10-09 15:28:41 -07:00
pr_emerg ( " vma %p start %p end %p \n "
2014-10-09 15:28:34 -07:00
" next %p prev %p mm %p \n "
" prot %lx anon_vma %p vm_ops %p \n "
2016-03-15 14:55:59 -07:00
" pgoff %lx file %p private_data %p \n "
" flags: %#lx(%pGv) \n " ,
2014-10-09 15:28:34 -07:00
vma , ( void * ) vma - > vm_start , ( void * ) vma - > vm_end , vma - > vm_next ,
vma - > vm_prev , vma - > vm_mm ,
( unsigned long ) pgprot_val ( vma - > vm_page_prot ) ,
vma - > anon_vma , vma - > vm_ops , vma - > vm_pgoff ,
2016-03-15 14:55:59 -07:00
vma - > vm_file , vma - > vm_private_data ,
vma - > vm_flags , & vma - > vm_flags ) ;
2014-10-09 15:28:34 -07:00
}
EXPORT_SYMBOL ( dump_vma ) ;
2014-10-09 15:28:37 -07:00
void dump_mm ( const struct mm_struct * mm )
{
2014-10-09 15:28:41 -07:00
pr_emerg ( " mm %p mmap %p seqnum %d task_size %lu \n "
2014-10-09 15:28:37 -07:00
# ifdef CONFIG_MMU
" get_unmapped_area %p \n "
# endif
" mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu \n "
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 15:26:50 -08:00
" pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d \n "
2014-10-09 15:28:37 -07:00
" hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx \n "
2016-01-14 15:22:07 -08:00
" pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx \n "
2014-10-09 15:28:37 -07:00
" start_code %lx end_code %lx start_data %lx end_data %lx \n "
" start_brk %lx brk %lx start_stack %lx \n "
" arg_start %lx arg_end %lx env_start %lx env_end %lx \n "
" binfmt %p flags %lx core_state %p \n "
# ifdef CONFIG_AIO
" ioctx_table %p \n "
# endif
# ifdef CONFIG_MEMCG
" owner %p "
# endif
" exe_file %p \n "
# ifdef CONFIG_MMU_NOTIFIER
" mmu_notifier_mm %p \n "
# endif
# ifdef CONFIG_NUMA_BALANCING
" numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d \n "
# endif
" tlb_flush_pending %d \n "
2016-03-15 14:55:59 -07:00
" def_flags: %#lx(%pGv) \n " ,
2014-10-09 15:28:37 -07:00
mm , mm - > mmap , mm - > vmacache_seqnum , mm - > task_size ,
# ifdef CONFIG_MMU
mm - > get_unmapped_area ,
# endif
mm - > mmap_base , mm - > mmap_legacy_base , mm - > highest_vm_end ,
mm - > pgd , atomic_read ( & mm - > mm_users ) ,
atomic_read ( & mm - > mm_count ) ,
atomic_long_read ( ( atomic_long_t * ) & mm - > nr_ptes ) ,
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 15:26:50 -08:00
mm_nr_pmds ( ( struct mm_struct * ) mm ) ,
2014-10-09 15:28:37 -07:00
mm - > map_count ,
mm - > hiwater_rss , mm - > hiwater_vm , mm - > total_vm , mm - > locked_vm ,
2016-01-14 15:22:07 -08:00
mm - > pinned_vm , mm - > data_vm , mm - > exec_vm , mm - > stack_vm ,
2014-10-09 15:28:37 -07:00
mm - > start_code , mm - > end_code , mm - > start_data , mm - > end_data ,
mm - > start_brk , mm - > brk , mm - > start_stack ,
mm - > arg_start , mm - > arg_end , mm - > env_start , mm - > env_end ,
mm - > binfmt , mm - > flags , mm - > core_state ,
# ifdef CONFIG_AIO
mm - > ioctx_table ,
# endif
# ifdef CONFIG_MEMCG
mm - > owner ,
# endif
mm - > exe_file ,
# ifdef CONFIG_MMU_NOTIFIER
mm - > mmu_notifier_mm ,
# endif
# ifdef CONFIG_NUMA_BALANCING
mm - > numa_next_scan , mm - > numa_scan_offset , mm - > numa_scan_seq ,
# endif
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:23:56 -07:00
atomic_read ( & mm - > tlb_flush_pending ) ,
2016-03-15 14:55:59 -07:00
mm - > def_flags , & mm - > def_flags
) ;
2014-10-09 15:28:37 -07:00
}
2014-10-09 15:28:34 -07:00
# endif /* CONFIG_DEBUG_VM */