linux/fs/proc
Kirill A. Shutemov dc6c9a35b6 mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include <errno.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/prctl.h>

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:04 -08:00
..
array.c proc: task_state: ptrace_parent() doesn't need pid_alive() check 2014-12-10 17:41:09 -08:00
base.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
cmdline.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
consoles.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
cpuinfo.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
devices.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
fd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
fd.h proc: Move proc_fd() to fs/proc/fd.h 2013-05-01 17:29:39 -04:00
generic.c fs/proc.c: use rb_entry_safe() instead of rb_entry() 2014-12-10 17:41:09 -08:00
inode.c kill proc_ns completely 2014-12-10 21:30:57 -05:00
internal.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
interrupts.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
Kconfig kcore: add Kconfig help text 2013-11-13 12:09:33 +09:00
kcore.c fs/proc/kcore.c: don't add modules range to kcore if it's equal to vmcore range 2014-10-09 22:25:50 -04:00
kmsg.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
loadavg.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
Makefile proc: Implement /proc/thread-self to point at the directory of the current thread 2014-08-04 10:07:11 -07:00
meminfo.c fs/proc/meminfo.c: include cma info in proc/meminfo 2014-12-18 19:08:10 -08:00
namespaces.c kill proc_ns completely 2014-12-10 21:30:57 -05:00
nommu.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
page.c mm:add KPF_ZERO_PAGE flag for /proc/kpageflags 2015-02-11 17:06:00 -08:00
proc_net.c fs/proc: use a rb tree for the directory entries 2014-12-10 17:41:09 -08:00
proc_sysctl.c sysctl: remove typedef ctl_table 2014-08-08 15:57:24 -07:00
proc_tty.c proc: remove proc_tty_ldisc variable 2014-08-08 15:57:22 -07:00
root.c fs/proc: use a rb tree for the directory entries 2014-12-10 17:41:09 -08:00
self.c new helper: readlink_copy() 2014-04-01 23:19:15 -04:00
softirqs.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
stat.c genirq: Prevent proc race against freeing of irq descriptors 2014-12-13 13:33:07 +01:00
task_mmu.c mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00
task_nommu.c proc/maps: make vm_is_stack() logic namespace-friendly 2014-10-09 22:25:50 -04:00
thread_self.c proc: Implement /proc/thread-self to point at the directory of the current thread 2014-08-04 10:07:11 -07:00
uptime.c cputime: Default implementation of nsecs -> cputime conversion 2014-03-13 15:56:43 +01:00
version.c fs/proc: don't use module_init for non-modular core code 2014-01-23 16:37:02 -08:00
vmcore.c fs/proc/vmcore.c:mmap_vmcore: skip non-ram pages reported by hypervisors 2014-08-08 15:57:23 -07:00