2005-04-17 02:20:36 +04:00
/*
* mm / mremap . c
*
* ( C ) Copyright 1996 Linus Torvalds
*
2009-01-05 17:06:29 +03:00
* Address space accounting code < alan @ lxorguk . ukuu . org . uk >
2005-04-17 02:20:36 +04:00
* ( C ) Copyright 2002 Red Hat Inc , All Rights Reserved
*/
# include <linux/mm.h>
# include <linux/hugetlb.h>
# include <linux/shm.h>
2009-09-22 04:02:05 +04:00
# include <linux/ksm.h>
2005-04-17 02:20:36 +04:00
# include <linux/mman.h>
# include <linux/swap.h>
2006-01-11 23:17:46 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/fs.h>
# include <linux/highmem.h>
# include <linux/security.h>
# include <linux/syscalls.h>
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 02:46:29 +04:00
# include <linux/mmu_notifier.h>
2005-04-17 02:20:36 +04:00
# include <asm/uaccess.h>
# include <asm/cacheflush.h>
# include <asm/tlbflush.h>
2008-10-19 07:26:50 +04:00
# include "internal.h"
2005-10-30 04:16:00 +03:00
static pmd_t * get_old_pmd ( struct mm_struct * mm , unsigned long addr )
2005-04-17 02:20:36 +04:00
{
pgd_t * pgd ;
pud_t * pud ;
pmd_t * pmd ;
pgd = pgd_offset ( mm , addr ) ;
if ( pgd_none_or_clear_bad ( pgd ) )
return NULL ;
pud = pud_offset ( pgd , addr ) ;
if ( pud_none_or_clear_bad ( pud ) )
return NULL ;
pmd = pmd_offset ( pud , addr ) ;
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 04:08:30 +04:00
if ( pmd_none ( * pmd ) )
2005-04-17 02:20:36 +04:00
return NULL ;
2005-10-30 04:16:00 +03:00
return pmd ;
2005-04-17 02:20:36 +04:00
}
2011-01-14 02:46:43 +03:00
static pmd_t * alloc_new_pmd ( struct mm_struct * mm , struct vm_area_struct * vma ,
unsigned long addr )
2005-04-17 02:20:36 +04:00
{
pgd_t * pgd ;
pud_t * pud ;
2005-10-30 04:16:23 +03:00
pmd_t * pmd ;
2005-04-17 02:20:36 +04:00
pgd = pgd_offset ( mm , addr ) ;
pud = pud_alloc ( mm , pgd , addr ) ;
if ( ! pud )
2005-10-30 04:16:23 +03:00
return NULL ;
2005-10-30 04:16:00 +03:00
2005-04-17 02:20:36 +04:00
pmd = pmd_alloc ( mm , pud , addr ) ;
2005-10-30 04:16:00 +03:00
if ( ! pmd )
2005-10-30 04:16:23 +03:00
return NULL ;
2005-10-30 04:16:00 +03:00
2011-01-14 02:46:43 +03:00
VM_BUG_ON ( pmd_trans_huge ( * pmd ) ) ;
2005-10-30 04:16:23 +03:00
2005-10-30 04:16:00 +03:00
return pmd ;
2005-04-17 02:20:36 +04:00
}
2005-10-30 04:16:00 +03:00
static void move_ptes ( struct vm_area_struct * vma , pmd_t * old_pmd ,
unsigned long old_addr , unsigned long old_end ,
struct vm_area_struct * new_vma , pmd_t * new_pmd ,
unsigned long new_addr )
2005-04-17 02:20:36 +04:00
{
struct address_space * mapping = NULL ;
struct mm_struct * mm = vma - > vm_mm ;
2005-10-30 04:16:00 +03:00
pte_t * old_pte , * new_pte , pte ;
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:40 +03:00
spinlock_t * old_ptl , * new_ptl ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_file ) {
/*
* Subtle point from Rajesh Venkatasubramanian : before
2009-08-20 20:35:05 +04:00
* moving file - based ptes , we must lock truncate_pagecache
* out , since it might clean the dst vma before the src vma ,
2005-04-17 02:20:36 +04:00
* and we propagate stale pages into the dst afterward .
*/
mapping = vma - > vm_file - > f_mapping ;
2011-05-25 04:12:06 +04:00
mutex_lock ( & mapping - > i_mmap_mutex ) ;
2005-04-17 02:20:36 +04:00
}
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:40 +03:00
/*
* We don ' t have to worry about the ordering of src and dst
* pte locks because exclusive mmap_sem prevents deadlock .
*/
2005-10-30 04:16:23 +03:00
old_pte = pte_offset_map_lock ( mm , old_pmd , old_addr , & old_ptl ) ;
2010-10-27 01:21:52 +04:00
new_pte = pte_offset_map ( new_pmd , new_addr ) ;
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:40 +03:00
new_ptl = pte_lockptr ( mm , new_pmd ) ;
if ( new_ptl ! = old_ptl )
2006-07-03 11:25:08 +04:00
spin_lock_nested ( new_ptl , SINGLE_DEPTH_NESTING ) ;
2006-10-01 10:29:33 +04:00
arch_enter_lazy_mmu_mode ( ) ;
2005-10-30 04:16:00 +03:00
for ( ; old_addr < old_end ; old_pte + + , old_addr + = PAGE_SIZE ,
new_pte + + , new_addr + = PAGE_SIZE ) {
if ( pte_none ( * old_pte ) )
continue ;
2011-11-01 04:08:26 +04:00
pte = ptep_get_and_clear ( mm , old_addr , old_pte ) ;
2005-10-30 04:16:00 +03:00
pte = move_pte ( pte , new_vma - > vm_page_prot , old_addr , new_addr ) ;
set_pte_at ( mm , new_addr , new_pte , pte ) ;
2005-04-17 02:20:36 +04:00
}
2005-10-30 04:16:00 +03:00
2006-10-01 10:29:33 +04:00
arch_leave_lazy_mmu_mode ( ) ;
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:40 +03:00
if ( new_ptl ! = old_ptl )
spin_unlock ( new_ptl ) ;
2010-10-27 01:21:52 +04:00
pte_unmap ( new_pte - 1 ) ;
2005-10-30 04:16:23 +03:00
pte_unmap_unlock ( old_pte - 1 , old_ptl ) ;
2005-04-17 02:20:36 +04:00
if ( mapping )
2011-05-25 04:12:06 +04:00
mutex_unlock ( & mapping - > i_mmap_mutex ) ;
2005-04-17 02:20:36 +04:00
}
2005-10-30 04:16:00 +03:00
# define LATENCY_LIMIT (64 * PAGE_SIZE)
2007-07-19 12:48:16 +04:00
unsigned long move_page_tables ( struct vm_area_struct * vma ,
2005-04-17 02:20:36 +04:00
unsigned long old_addr , struct vm_area_struct * new_vma ,
unsigned long new_addr , unsigned long len )
{
2005-10-30 04:16:00 +03:00
unsigned long extent , next , old_end ;
pmd_t * old_pmd , * new_pmd ;
2011-11-01 04:08:26 +04:00
bool need_flush = false ;
2005-04-17 02:20:36 +04:00
2005-10-30 04:16:00 +03:00
old_end = old_addr + len ;
flush_cache_range ( vma , old_addr , old_end ) ;
2005-04-17 02:20:36 +04:00
2011-11-01 04:08:26 +04:00
mmu_notifier_invalidate_range_start ( vma - > vm_mm , old_addr , old_end ) ;
2005-10-30 04:16:00 +03:00
for ( ; old_addr < old_end ; old_addr + = extent , new_addr + = extent ) {
2005-04-17 02:20:36 +04:00
cond_resched ( ) ;
2005-10-30 04:16:00 +03:00
next = ( old_addr + PMD_SIZE ) & PMD_MASK ;
2011-11-01 04:08:22 +04:00
/* even if next overflowed, extent below will be ok */
2005-10-30 04:16:00 +03:00
extent = next - old_addr ;
2011-11-01 04:08:22 +04:00
if ( extent > old_end - old_addr )
extent = old_end - old_addr ;
2005-10-30 04:16:00 +03:00
old_pmd = get_old_pmd ( vma - > vm_mm , old_addr ) ;
if ( ! old_pmd )
continue ;
2011-01-14 02:46:43 +03:00
new_pmd = alloc_new_pmd ( vma - > vm_mm , vma , new_addr ) ;
2005-10-30 04:16:00 +03:00
if ( ! new_pmd )
break ;
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 04:08:30 +04:00
if ( pmd_trans_huge ( * old_pmd ) ) {
int err = 0 ;
if ( extent = = HPAGE_PMD_SIZE )
err = move_huge_pmd ( vma , new_vma , old_addr ,
new_addr , old_end ,
old_pmd , new_pmd ) ;
if ( err > 0 ) {
need_flush = true ;
continue ;
} else if ( ! err ) {
split_huge_page_pmd ( vma - > vm_mm , old_pmd ) ;
}
VM_BUG_ON ( pmd_trans_huge ( * old_pmd ) ) ;
}
if ( pmd_none ( * new_pmd ) & & __pte_alloc ( new_vma - > vm_mm , new_vma ,
new_pmd , new_addr ) )
break ;
2005-10-30 04:16:00 +03:00
next = ( new_addr + PMD_SIZE ) & PMD_MASK ;
if ( extent > next - new_addr )
extent = next - new_addr ;
if ( extent > LATENCY_LIMIT )
extent = LATENCY_LIMIT ;
move_ptes ( vma , old_pmd , old_addr , old_addr + extent ,
new_vma , new_pmd , new_addr ) ;
2011-11-01 04:08:26 +04:00
need_flush = true ;
2005-04-17 02:20:36 +04:00
}
2011-11-01 04:08:26 +04:00
if ( likely ( need_flush ) )
flush_tlb_range ( vma , old_end - len , old_addr ) ;
mmu_notifier_invalidate_range_end ( vma - > vm_mm , old_end - len , old_end ) ;
2005-10-30 04:16:00 +03:00
return len + old_addr - old_end ; /* how much done */
2005-04-17 02:20:36 +04:00
}
static unsigned long move_vma ( struct vm_area_struct * vma ,
unsigned long old_addr , unsigned long old_len ,
unsigned long new_len , unsigned long new_addr )
{
struct mm_struct * mm = vma - > vm_mm ;
struct vm_area_struct * new_vma ;
unsigned long vm_flags = vma - > vm_flags ;
unsigned long new_pgoff ;
unsigned long moved_len ;
unsigned long excess = 0 ;
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
unsigned long hiwater_vm ;
2005-04-17 02:20:36 +04:00
int split = 0 ;
2009-09-22 04:02:28 +04:00
int err ;
2005-04-17 02:20:36 +04:00
/*
* We ' d prefer to avoid failure later on in do_munmap :
* which may split one vma into three before unmapping .
*/
if ( mm - > map_count > = sysctl_max_map_count - 3 )
return - ENOMEM ;
2009-09-22 04:02:05 +04:00
/*
* Advise KSM to break any KSM pages in the area to be moved :
* it would be confusing if they were to turn up at the new
* location , where they happen to coincide with different KSM
* pages recently unmapped . But leave vma - > vm_flags as it was ,
* so KSM can come around to merge on vma and new_vma afterwards .
*/
2009-09-22 04:02:28 +04:00
err = ksm_madvise ( vma , old_addr , old_addr + old_len ,
MADV_UNMERGEABLE , & vm_flags ) ;
if ( err )
return err ;
2009-09-22 04:02:05 +04:00
2005-04-17 02:20:36 +04:00
new_pgoff = vma - > vm_pgoff + ( ( old_addr - vma - > vm_start ) > > PAGE_SHIFT ) ;
new_vma = copy_vma ( & vma , new_addr , new_len , new_pgoff ) ;
if ( ! new_vma )
return - ENOMEM ;
moved_len = move_page_tables ( vma , old_addr , new_vma , new_addr , old_len ) ;
if ( moved_len < old_len ) {
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:08:05 +04:00
/*
* Before moving the page tables from the new vma to
* the old vma , we need to be sure the old vma is
* queued after new vma in the same_anon_vma list to
* prevent SMP races with rmap_walk ( that could lead
* rmap_walk to miss some page table ) .
*/
anon_vma_moveto_tail ( vma ) ;
2005-04-17 02:20:36 +04:00
/*
* On error , move entries back from new area to old ,
* which will succeed since page tables still there ,
* and then proceed to unmap new area instead of old .
*/
move_page_tables ( new_vma , new_addr , vma , old_addr , moved_len ) ;
vma = new_vma ;
old_len = new_len ;
old_addr = new_addr ;
new_addr = - ENOMEM ;
}
/* Conceal VM_ACCOUNT so old reservation is not undone */
if ( vm_flags & VM_ACCOUNT ) {
vma - > vm_flags & = ~ VM_ACCOUNT ;
excess = vma - > vm_end - vma - > vm_start - old_len ;
if ( old_addr > vma - > vm_start & &
old_addr + old_len < vma - > vm_end )
split = 1 ;
}
2005-05-17 08:53:18 +04:00
/*
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
* If we failed to move page tables we still do total_vm increment
* since do_munmap ( ) will decrement it by old_len = = new_len .
*
* Since total_vm is about to be raised artificially high for a
* moment , we need to restore high watermark afterwards : if stats
* are taken meanwhile , total_vm and hiwater_vm appear too high .
* If this were a serious issue , we ' d add a flag to do_munmap ( ) .
2005-05-17 08:53:18 +04:00
*/
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
hiwater_vm = mm - > hiwater_vm ;
2005-05-17 08:53:18 +04:00
mm - > total_vm + = new_len > > PAGE_SHIFT ;
2005-10-30 04:15:56 +03:00
vm_stat_account ( mm , vma - > vm_flags , vma - > vm_file , new_len > > PAGE_SHIFT ) ;
2005-05-17 08:53:18 +04:00
2005-04-17 02:20:36 +04:00
if ( do_munmap ( mm , old_addr , old_len ) < 0 ) {
/* OOM: unable to split vma, just get accounts right */
vm_unacct_memory ( excess > > PAGE_SHIFT ) ;
excess = 0 ;
}
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
mm - > hiwater_vm = hiwater_vm ;
2005-04-17 02:20:36 +04:00
/* Restore VM_ACCOUNT if one or two pieces of vma left */
if ( excess ) {
vma - > vm_flags | = VM_ACCOUNT ;
if ( split )
vma - > vm_next - > vm_flags | = VM_ACCOUNT ;
}
if ( vm_flags & VM_LOCKED ) {
mm - > locked_vm + = new_len > > PAGE_SHIFT ;
if ( new_len > old_len )
2008-10-19 07:26:50 +04:00
mlock_vma_pages_range ( new_vma , new_addr + old_len ,
new_addr + new_len ) ;
2005-04-17 02:20:36 +04:00
}
return new_addr ;
}
2009-11-24 15:17:46 +03:00
static struct vm_area_struct * vma_to_resize ( unsigned long addr ,
unsigned long old_len , unsigned long new_len , unsigned long * p )
{
struct mm_struct * mm = current - > mm ;
struct vm_area_struct * vma = find_vma ( mm , addr ) ;
if ( ! vma | | vma - > vm_start > addr )
goto Efault ;
if ( is_vm_hugetlb_page ( vma ) )
goto Einval ;
/* We can't remap across vm area boundaries */
if ( old_len > vma - > vm_end - addr )
goto Efault ;
2011-04-07 18:35:50 +04:00
/* Need to be careful about a growing mapping */
if ( new_len > old_len ) {
unsigned long pgoff ;
if ( vma - > vm_flags & ( VM_DONTEXPAND | VM_PFNMAP ) )
2009-11-24 15:17:46 +03:00
goto Efault ;
2011-04-07 18:35:50 +04:00
pgoff = ( addr - vma - > vm_start ) > > PAGE_SHIFT ;
pgoff + = vma - > vm_pgoff ;
if ( pgoff + ( new_len > > PAGE_SHIFT ) < pgoff )
goto Einval ;
2009-11-24 15:17:46 +03:00
}
if ( vma - > vm_flags & VM_LOCKED ) {
unsigned long locked , lock_limit ;
locked = mm - > locked_vm < < PAGE_SHIFT ;
2010-03-06 00:41:44 +03:00
lock_limit = rlimit ( RLIMIT_MEMLOCK ) ;
2009-11-24 15:17:46 +03:00
locked + = new_len - old_len ;
if ( locked > lock_limit & & ! capable ( CAP_IPC_LOCK ) )
goto Eagain ;
}
if ( ! may_expand_vm ( mm , ( new_len - old_len ) > > PAGE_SHIFT ) )
goto Enomem ;
if ( vma - > vm_flags & VM_ACCOUNT ) {
unsigned long charged = ( new_len - old_len ) > > PAGE_SHIFT ;
2012-02-13 07:58:52 +04:00
if ( security_vm_enough_memory_mm ( mm , charged ) )
2009-11-24 15:17:46 +03:00
goto Efault ;
* p = charged ;
}
return vma ;
Efault : /* very odd choice for most of the cases, but... */
return ERR_PTR ( - EFAULT ) ;
Einval :
return ERR_PTR ( - EINVAL ) ;
Enomem :
return ERR_PTR ( - ENOMEM ) ;
Eagain :
return ERR_PTR ( - EAGAIN ) ;
}
2009-11-24 15:28:07 +03:00
static unsigned long mremap_to ( unsigned long addr ,
unsigned long old_len , unsigned long new_addr ,
unsigned long new_len )
{
struct mm_struct * mm = current - > mm ;
struct vm_area_struct * vma ;
unsigned long ret = - EINVAL ;
unsigned long charged = 0 ;
2009-11-24 16:43:52 +03:00
unsigned long map_flags ;
2009-11-24 15:28:07 +03:00
if ( new_addr & ~ PAGE_MASK )
goto out ;
if ( new_len > TASK_SIZE | | new_addr > TASK_SIZE - new_len )
goto out ;
/* Check if the location we're moving into overlaps the
* old location at all , and fail if it does .
*/
if ( ( new_addr < = addr ) & & ( new_addr + new_len ) > addr )
goto out ;
if ( ( addr < = new_addr ) & & ( addr + old_len ) > new_addr )
goto out ;
ret = security_file_mmap ( NULL , 0 , 0 , 0 , new_addr , 1 ) ;
if ( ret )
goto out ;
ret = do_munmap ( mm , new_addr , new_len ) ;
if ( ret )
goto out ;
if ( old_len > = new_len ) {
ret = do_munmap ( mm , addr + new_len , old_len - new_len ) ;
if ( ret & & old_len ! = new_len )
goto out ;
old_len = new_len ;
}
vma = vma_to_resize ( addr , old_len , new_len , & charged ) ;
if ( IS_ERR ( vma ) ) {
ret = PTR_ERR ( vma ) ;
goto out ;
}
2009-11-24 16:43:52 +03:00
map_flags = MAP_FIXED ;
if ( vma - > vm_flags & VM_MAYSHARE )
map_flags | = MAP_SHARED ;
2009-12-03 23:23:11 +03:00
2009-11-24 16:43:52 +03:00
ret = get_unmapped_area ( vma - > vm_file , new_addr , new_len , vma - > vm_pgoff +
( ( addr - vma - > vm_start ) > > PAGE_SHIFT ) ,
map_flags ) ;
2009-11-24 15:28:07 +03:00
if ( ret & ~ PAGE_MASK )
2009-11-24 16:43:52 +03:00
goto out1 ;
ret = move_vma ( vma , addr , old_len , new_len , new_addr ) ;
if ( ! ( ret & ~ PAGE_MASK ) )
goto out ;
out1 :
vm_unacct_memory ( charged ) ;
2009-11-24 15:28:07 +03:00
out :
return ret ;
}
2009-11-24 15:43:18 +03:00
static int vma_expandable ( struct vm_area_struct * vma , unsigned long delta )
{
2009-11-24 16:25:18 +03:00
unsigned long end = vma - > vm_end + delta ;
2009-12-03 23:23:11 +03:00
if ( end < vma - > vm_end ) /* overflow */
2009-11-24 16:25:18 +03:00
return 0 ;
2009-12-03 23:23:11 +03:00
if ( vma - > vm_next & & vma - > vm_next - > vm_start < end ) /* intersection */
2009-11-24 16:25:18 +03:00
return 0 ;
if ( get_unmapped_area ( NULL , vma - > vm_start , end - vma - > vm_start ,
0 , MAP_FIXED ) & ~ PAGE_MASK )
2009-11-24 15:43:18 +03:00
return 0 ;
return 1 ;
}
2005-04-17 02:20:36 +04:00
/*
* Expand ( or shrink ) an existing mapping , potentially moving it at the
* same time ( controlled by the MREMAP_MAYMOVE flag and available VM space )
*
* MREMAP_FIXED option added 5 - Dec - 1999 by Benjamin LaHaise
* This option implies MREMAP_MAYMOVE .
*/
unsigned long do_mremap ( unsigned long addr ,
unsigned long old_len , unsigned long new_len ,
unsigned long flags , unsigned long new_addr )
{
2005-10-30 04:16:16 +03:00
struct mm_struct * mm = current - > mm ;
2005-04-17 02:20:36 +04:00
struct vm_area_struct * vma ;
unsigned long ret = - EINVAL ;
unsigned long charged = 0 ;
if ( flags & ~ ( MREMAP_FIXED | MREMAP_MAYMOVE ) )
goto out ;
if ( addr & ~ PAGE_MASK )
goto out ;
old_len = PAGE_ALIGN ( old_len ) ;
new_len = PAGE_ALIGN ( new_len ) ;
/*
* We allow a zero old - len as a special case
* for DOS - emu " duplicate shm area " thing . But
* a zero new - len is nonsensical .
*/
if ( ! new_len )
goto out ;
if ( flags & MREMAP_FIXED ) {
2009-11-24 15:28:07 +03:00
if ( flags & MREMAP_MAYMOVE )
ret = mremap_to ( addr , old_len , new_addr , new_len ) ;
goto out ;
2005-04-17 02:20:36 +04:00
}
/*
* Always allow a shrinking remap : that just unmaps
* the unnecessary pages . .
* do_munmap does all the needed commit accounting
*/
if ( old_len > = new_len ) {
2005-10-30 04:16:16 +03:00
ret = do_munmap ( mm , addr + new_len , old_len - new_len ) ;
2005-04-17 02:20:36 +04:00
if ( ret & & old_len ! = new_len )
goto out ;
ret = addr ;
2009-11-24 15:28:07 +03:00
goto out ;
2005-04-17 02:20:36 +04:00
}
/*
2009-11-24 15:28:07 +03:00
* Ok , we need to grow . .
2005-04-17 02:20:36 +04:00
*/
2009-11-24 15:17:46 +03:00
vma = vma_to_resize ( addr , old_len , new_len , & charged ) ;
if ( IS_ERR ( vma ) ) {
ret = PTR_ERR ( vma ) ;
2005-04-17 02:20:36 +04:00
goto out ;
2005-05-01 19:58:35 +04:00
}
2005-04-17 02:20:36 +04:00
/* old_len exactly to the end of the area..
*/
2009-11-24 15:28:07 +03:00
if ( old_len = = vma - > vm_end - addr ) {
2005-04-17 02:20:36 +04:00
/* can we just expand the current mapping? */
2009-11-24 15:43:18 +03:00
if ( vma_expandable ( vma , new_len - old_len ) ) {
2005-04-17 02:20:36 +04:00
int pages = ( new_len - old_len ) > > PAGE_SHIFT ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
if ( vma_adjust ( vma , vma - > vm_start , addr + new_len ,
vma - > vm_pgoff , NULL ) ) {
ret = - ENOMEM ;
goto out ;
}
2005-04-17 02:20:36 +04:00
2005-10-30 04:16:16 +03:00
mm - > total_vm + = pages ;
vm_stat_account ( mm , vma - > vm_flags , vma - > vm_file , pages ) ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_flags & VM_LOCKED ) {
2005-10-30 04:16:16 +03:00
mm - > locked_vm + = pages ;
2008-10-19 07:26:50 +04:00
mlock_vma_pages_range ( vma , addr + old_len ,
2005-04-17 02:20:36 +04:00
addr + new_len ) ;
}
ret = addr ;
goto out ;
}
}
/*
* We weren ' t able to just expand or shrink the area ,
* we need to create a new one and move it . .
*/
ret = - ENOMEM ;
if ( flags & MREMAP_MAYMOVE ) {
2009-11-24 15:28:07 +03:00
unsigned long map_flags = 0 ;
if ( vma - > vm_flags & VM_MAYSHARE )
map_flags | = MAP_SHARED ;
new_addr = get_unmapped_area ( vma - > vm_file , 0 , new_len ,
2009-11-24 16:45:24 +03:00
vma - > vm_pgoff +
( ( addr - vma - > vm_start ) > > PAGE_SHIFT ) ,
map_flags ) ;
2009-11-24 15:28:07 +03:00
if ( new_addr & ~ PAGE_MASK ) {
ret = new_addr ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2009-11-24 15:28:07 +03:00
ret = security_file_mmap ( NULL , 0 , 0 , 0 , new_addr , 1 ) ;
if ( ret )
goto out ;
2005-04-17 02:20:36 +04:00
ret = move_vma ( vma , addr , old_len , new_len , new_addr ) ;
}
out :
if ( ret & ~ PAGE_MASK )
vm_unacct_memory ( charged ) ;
return ret ;
}
2009-01-14 16:14:15 +03:00
SYSCALL_DEFINE5 ( mremap , unsigned long , addr , unsigned long , old_len ,
unsigned long , new_len , unsigned long , flags ,
unsigned long , new_addr )
2005-04-17 02:20:36 +04:00
{
unsigned long ret ;
down_write ( & current - > mm - > mmap_sem ) ;
ret = do_mremap ( addr , old_len , new_len , flags , new_addr ) ;
up_write ( & current - > mm - > mmap_sem ) ;
return ret ;
}