2005-04-17 02:20:36 +04:00
/*
* linux / mm / nommu . c
*
* Replacement code for mm functions to support CPU ' s that don ' t
* have any form of memory management unit ( thus no virtual memory ) .
*
* See Documentation / nommu - mmap . txt
*
2009-01-08 15:04:47 +03:00
* Copyright ( c ) 2004 - 2008 David Howells < dhowells @ redhat . com >
2005-04-17 02:20:36 +04:00
* Copyright ( c ) 2000 - 2003 David McCullough < davidm @ snapgear . com >
* Copyright ( c ) 2000 - 2001 D Jeff Dionne < jeff @ uClinux . org >
* Copyright ( c ) 2002 Greg Ungerer < gerg @ snapgear . com >
2010-12-24 06:08:30 +03:00
* Copyright ( c ) 2007 - 2010 Paul Mundt < lethal @ linux - sh . org >
2005-04-17 02:20:36 +04:00
*/
2014-06-07 01:38:30 +04:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2011-10-16 10:01:52 +04:00
# include <linux/export.h>
2005-04-17 02:20:36 +04:00
# include <linux/mm.h>
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
# include <linux/vmacache.h>
2005-04-17 02:20:36 +04:00
# include <linux/mman.h>
# include <linux/swap.h>
# include <linux/file.h>
# include <linux/highmem.h>
# include <linux/pagemap.h>
# include <linux/slab.h>
# include <linux/vmalloc.h>
# include <linux/blkdev.h>
# include <linux/backing-dev.h>
2014-04-08 02:37:26 +04:00
# include <linux/compiler.h>
2005-04-17 02:20:36 +04:00
# include <linux/mount.h>
# include <linux/personality.h>
# include <linux/security.h>
# include <linux/syscalls.h>
2010-10-30 10:54:44 +04:00
# include <linux/audit.h>
2013-02-07 19:46:59 +04:00
# include <linux/sched/sysctl.h>
2014-06-07 01:38:30 +04:00
# include <linux/printk.h>
2005-04-17 02:20:36 +04:00
# include <asm/uaccess.h>
# include <asm/tlb.h>
# include <asm/tlbflush.h>
2009-09-22 04:03:57 +04:00
# include <asm/mmu_context.h>
2009-01-08 15:04:47 +03:00
# include "internal.h"
2005-04-17 02:20:36 +04:00
void * high_memory ;
2015-02-05 23:25:12 +03:00
EXPORT_SYMBOL ( high_memory ) ;
2005-04-17 02:20:36 +04:00
struct page * mem_map ;
unsigned long max_mapnr ;
2015-03-13 02:26:05 +03:00
EXPORT_SYMBOL ( max_mapnr ) ;
2009-09-23 20:05:53 +04:00
unsigned long highest_memmap_pfn ;
2009-05-01 02:08:51 +04:00
struct percpu_counter vm_committed_as ;
2005-04-17 02:20:36 +04:00
int sysctl_overcommit_memory = OVERCOMMIT_GUESS ; /* heuristic overcommit */
int sysctl_overcommit_ratio = 50 ; /* default is 50% */
2014-01-22 03:49:14 +04:00
unsigned long sysctl_overcommit_kbytes __read_mostly ;
2005-04-17 02:20:36 +04:00
int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT ;
2009-05-07 03:03:05 +04:00
int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS ;
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL < < 17 ; /* 128MB */
2013-04-30 02:08:11 +04:00
unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL < < 13 ; /* 8MB */
2005-04-17 02:20:36 +04:00
int heap_stack_gap = 0 ;
2009-04-03 03:56:32 +04:00
atomic_long_t mmap_pages_allocated ;
2009-01-08 15:04:47 +03:00
2012-11-16 02:34:42 +04:00
/*
* The global memory commitment made in the system can be a metric
* that can be used to drive ballooning decisions when Linux is hosted
* as a guest . On Hyper - V , the host implements a policy engine for dynamically
* balancing memory across competing virtual machines that are hosted .
* Several metrics drive this policy engine including the guest reported
* memory commitment .
*/
unsigned long vm_memory_committed ( void )
{
return percpu_counter_read_positive ( & vm_committed_as ) ;
}
EXPORT_SYMBOL_GPL ( vm_memory_committed ) ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( mem_map ) ;
2009-01-08 15:04:47 +03:00
/* list of mapped, potentially shareable regions */
static struct kmem_cache * vm_region_jar ;
struct rb_root nommu_region_tree = RB_ROOT ;
DECLARE_RWSEM ( nommu_region_sem ) ;
2005-04-17 02:20:36 +04:00
2009-09-27 22:29:37 +04:00
const struct vm_operations_struct generic_file_vm_ops = {
2005-04-17 02:20:36 +04:00
} ;
/*
* Return the total memory allocated for this pointer , not
* just what the caller asked for .
*
* Doesn ' t have to be accurate , i . e . may have races .
*/
unsigned int kobjsize ( const void * objp )
{
struct page * page ;
2008-04-28 13:13:38 +04:00
/*
* If the object we have should not have ksize performed on it ,
* return size of 0
*/
2008-06-12 11:29:55 +04:00
if ( ! objp | | ! virt_addr_valid ( objp ) )
2008-06-06 09:46:08 +04:00
return 0 ;
page = virt_to_head_page ( objp ) ;
/*
* If the allocator sets PageSlab , we know the pointer came from
* kmalloc ( ) .
*/
2005-04-17 02:20:36 +04:00
if ( PageSlab ( page ) )
return ksize ( objp ) ;
2009-01-08 15:04:48 +03:00
/*
* If it ' s not a compound page , see if we have a matching VMA
* region . This test is intentionally done in reverse order ,
* so if there ' s no VMA , we still fall through and hand back
* PAGE_SIZE for 0 - order pages .
*/
if ( ! PageCompound ( page ) ) {
struct vm_area_struct * vma ;
vma = find_vma ( current - > mm , ( unsigned long ) objp ) ;
if ( vma )
return vma - > vm_end - vma - > vm_start ;
}
2008-06-06 09:46:08 +04:00
/*
* The ksize ( ) function is only guaranteed to work for pointers
2008-06-12 11:29:55 +04:00
* returned by kmalloc ( ) . So handle arbitrary pointers here .
2008-06-06 09:46:08 +04:00
*/
2008-06-12 11:29:55 +04:00
return PAGE_SIZE < < compound_order ( page ) ;
2005-04-17 02:20:36 +04:00
}
2013-02-23 04:35:55 +04:00
long __get_user_pages ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long start , unsigned long nr_pages ,
unsigned int foll_flags , struct page * * pages ,
struct vm_area_struct * * vmas , int * nonblocking )
2005-04-17 02:20:36 +04:00
{
2006-09-27 12:50:17 +04:00
struct vm_area_struct * vma ;
2006-09-27 12:50:18 +04:00
unsigned long vm_flags ;
int i ;
/* calculate required read or write permissions.
2009-09-22 04:03:31 +04:00
* If FOLL_FORCE is set , we only require the " MAY " flags .
2006-09-27 12:50:18 +04:00
*/
2009-09-22 04:03:31 +04:00
vm_flags = ( foll_flags & FOLL_WRITE ) ?
( VM_WRITE | VM_MAYWRITE ) : ( VM_READ | VM_MAYREAD ) ;
vm_flags & = ( foll_flags & FOLL_FORCE ) ?
( VM_MAYREAD | VM_MAYWRITE ) : ( VM_READ | VM_WRITE ) ;
2005-04-17 02:20:36 +04:00
2009-06-25 13:58:55 +04:00
for ( i = 0 ; i < nr_pages ; i + + ) {
2010-03-25 19:48:38 +03:00
vma = find_vma ( mm , start ) ;
2006-09-27 12:50:18 +04:00
if ( ! vma )
goto finish_or_fault ;
/* protect what we can, including chardevs */
2009-09-22 04:03:24 +04:00
if ( ( vma - > vm_flags & ( VM_IO | VM_PFNMAP ) ) | |
! ( vm_flags & vma - > vm_flags ) )
2006-09-27 12:50:18 +04:00
goto finish_or_fault ;
2006-09-27 12:50:17 +04:00
2005-04-17 02:20:36 +04:00
if ( pages ) {
pages [ i ] = virt_to_page ( start ) ;
if ( pages [ i ] )
page_cache_get ( pages [ i ] ) ;
}
if ( vmas )
2006-09-27 12:50:17 +04:00
vmas [ i ] = vma ;
2010-03-25 19:48:44 +03:00
start = ( start + PAGE_SIZE ) & PAGE_MASK ;
2005-04-17 02:20:36 +04:00
}
2006-09-27 12:50:18 +04:00
return i ;
finish_or_fault :
return i ? : - EFAULT ;
2005-04-17 02:20:36 +04:00
}
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:26:44 +04:00
/*
* get a list of pages in an address range belonging to the specified process
* and indicate the VMA that covers each page
* - this is potentially dodgy as we may end incrementing the page count of a
* slab page or a secondary page from a compound page
* - don ' t permit access to VMAs that don ' t support it , such as I / O mappings
*/
2013-02-23 04:35:55 +04:00
long get_user_pages ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long start , unsigned long nr_pages ,
int write , int force , struct page * * pages ,
struct vm_area_struct * * vmas )
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:26:44 +04:00
{
int flags = 0 ;
if ( write )
2009-09-22 04:03:31 +04:00
flags | = FOLL_WRITE ;
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:26:44 +04:00
if ( force )
2009-09-22 04:03:31 +04:00
flags | = FOLL_FORCE ;
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:26:44 +04:00
2011-01-14 02:46:14 +03:00
return __get_user_pages ( tsk , mm , start , nr_pages , flags , pages , vmas ,
NULL ) ;
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 07:26:44 +04:00
}
2005-09-12 05:18:10 +04:00
EXPORT_SYMBOL ( get_user_pages ) ;
mm: gup: add get_user_pages_locked and get_user_pages_unlocked
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.
Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).
So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.
The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).
A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).
While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .
The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.
This patch (of 5):
We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.
The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:
down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);
The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).
If vmas is not NULL these two methods cannot be used.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 02:27:17 +03:00
long get_user_pages_locked ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long start , unsigned long nr_pages ,
int write , int force , struct page * * pages ,
int * locked )
{
return get_user_pages ( tsk , mm , start , nr_pages , write , force ,
pages , NULL ) ;
}
EXPORT_SYMBOL ( get_user_pages_locked ) ;
2015-02-12 02:27:20 +03:00
long __get_user_pages_unlocked ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long start , unsigned long nr_pages ,
int write , int force , struct page * * pages ,
unsigned int gup_flags )
mm: gup: add get_user_pages_locked and get_user_pages_unlocked
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.
Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).
So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.
The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).
A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).
While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .
The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.
This patch (of 5):
We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.
The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:
down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);
The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).
If vmas is not NULL these two methods cannot be used.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 02:27:17 +03:00
{
long ret ;
down_read ( & mm - > mmap_sem ) ;
ret = get_user_pages ( tsk , mm , start , nr_pages , write , force ,
pages , NULL ) ;
up_read ( & mm - > mmap_sem ) ;
return ret ;
}
2015-02-12 02:27:20 +03:00
EXPORT_SYMBOL ( __get_user_pages_unlocked ) ;
long get_user_pages_unlocked ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long start , unsigned long nr_pages ,
int write , int force , struct page * * pages )
{
return __get_user_pages_unlocked ( tsk , mm , start , nr_pages , write ,
force , pages , 0 ) ;
}
mm: gup: add get_user_pages_locked and get_user_pages_unlocked
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.
Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).
So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.
The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).
A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).
While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .
The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.
This patch (of 5):
We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.
The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:
down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);
The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).
If vmas is not NULL these two methods cannot be used.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 02:27:17 +03:00
EXPORT_SYMBOL ( get_user_pages_unlocked ) ;
2009-06-25 23:31:57 +04:00
/**
* follow_pfn - look up PFN at a user virtual address
* @ vma : memory mapping
* @ address : user virtual address
* @ pfn : location to store found PFN
*
* Only IO mappings and raw PFN mappings are allowed .
*
* Returns zero and the pfn at @ pfn on success , - ve otherwise .
*/
int follow_pfn ( struct vm_area_struct * vma , unsigned long address ,
unsigned long * pfn )
{
if ( ! ( vma - > vm_flags & ( VM_IO | VM_PFNMAP ) ) )
return - EINVAL ;
* pfn = address > > PAGE_SHIFT ;
return 0 ;
}
EXPORT_SYMBOL ( follow_pfn ) ;
2013-04-30 02:07:37 +04:00
LIST_HEAD ( vmap_area_list ) ;
2005-04-17 02:20:36 +04:00
2008-02-05 09:28:32 +03:00
void vfree ( const void * addr )
2005-04-17 02:20:36 +04:00
{
kfree ( addr ) ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vfree ) ;
2005-04-17 02:20:36 +04:00
2005-10-07 10:46:04 +04:00
void * __vmalloc ( unsigned long size , gfp_t gfp_mask , pgprot_t prot )
2005-04-17 02:20:36 +04:00
{
/*
2007-10-20 01:11:38 +04:00
* You can ' t specify __GFP_HIGHMEM with kmalloc ( ) since kmalloc ( )
* returns only a logical address .
2005-04-17 02:20:36 +04:00
*/
2006-03-22 11:08:34 +03:00
return kmalloc ( size , ( gfp_mask | __GFP_COMP ) & ~ __GFP_HIGHMEM ) ;
2005-04-17 02:20:36 +04:00
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( __vmalloc ) ;
2005-04-17 02:20:36 +04:00
2008-02-05 09:29:59 +03:00
void * vmalloc_user ( unsigned long size )
{
void * ret ;
ret = __vmalloc ( size , GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO ,
PAGE_KERNEL ) ;
if ( ret ) {
struct vm_area_struct * vma ;
down_write ( & current - > mm - > mmap_sem ) ;
vma = find_vma ( current - > mm , ( unsigned long ) ret ) ;
if ( vma )
vma - > vm_flags | = VM_USERMAP ;
up_write ( & current - > mm - > mmap_sem ) ;
}
return ret ;
}
EXPORT_SYMBOL ( vmalloc_user ) ;
2008-02-05 09:28:32 +03:00
struct page * vmalloc_to_page ( const void * addr )
2005-04-17 02:20:36 +04:00
{
return virt_to_page ( addr ) ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vmalloc_to_page ) ;
2005-04-17 02:20:36 +04:00
2008-02-05 09:28:32 +03:00
unsigned long vmalloc_to_pfn ( const void * addr )
2005-04-17 02:20:36 +04:00
{
return page_to_pfn ( virt_to_page ( addr ) ) ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vmalloc_to_pfn ) ;
2005-04-17 02:20:36 +04:00
long vread ( char * buf , char * addr , unsigned long count )
{
2013-07-04 02:02:36 +04:00
/* Don't allow overflow */
if ( ( unsigned long ) buf + count < count )
count = - ( unsigned long ) buf ;
2005-04-17 02:20:36 +04:00
memcpy ( buf , addr , count ) ;
return count ;
}
long vwrite ( char * buf , char * addr , unsigned long count )
{
/* Don't allow overflow */
if ( ( unsigned long ) addr + count < count )
count = - ( unsigned long ) addr ;
memcpy ( addr , buf , count ) ;
2014-04-08 02:37:36 +04:00
return count ;
2005-04-17 02:20:36 +04:00
}
/*
2015-07-07 04:14:59 +03:00
* vmalloc - allocate virtually contiguous memory
2005-04-17 02:20:36 +04:00
*
* @ size : allocation size
*
* Allocate enough pages to cover @ size from the page level
2015-07-07 04:14:59 +03:00
* allocator and map them into contiguous kernel virtual space .
2005-04-17 02:20:36 +04:00
*
2006-10-04 01:21:02 +04:00
* For tight control over page level allocator and protection flags
2005-04-17 02:20:36 +04:00
* use __vmalloc ( ) instead .
*/
void * vmalloc ( unsigned long size )
{
return __vmalloc ( size , GFP_KERNEL | __GFP_HIGHMEM , PAGE_KERNEL ) ;
}
2006-03-01 03:59:18 +03:00
EXPORT_SYMBOL ( vmalloc ) ;
2010-10-27 01:22:06 +04:00
/*
2015-07-07 04:14:59 +03:00
* vzalloc - allocate virtually contiguous memory with zero fill
2010-10-27 01:22:06 +04:00
*
* @ size : allocation size
*
* Allocate enough pages to cover @ size from the page level
2015-07-07 04:14:59 +03:00
* allocator and map them into contiguous kernel virtual space .
2010-10-27 01:22:06 +04:00
* The memory allocated is set to zero .
*
* For tight control over page level allocator and protection flags
* use __vmalloc ( ) instead .
*/
void * vzalloc ( unsigned long size )
{
return __vmalloc ( size , GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO ,
PAGE_KERNEL ) ;
}
EXPORT_SYMBOL ( vzalloc ) ;
/**
* vmalloc_node - allocate memory on a specific node
* @ size : allocation size
* @ node : numa node
*
* Allocate enough pages to cover @ size from the page level
* allocator and map them into contiguous kernel virtual space .
*
* For tight control over page level allocator and protection flags
* use __vmalloc ( ) instead .
*/
2006-03-01 03:59:18 +03:00
void * vmalloc_node ( unsigned long size , int node )
{
return vmalloc ( size ) ;
}
2010-12-24 05:50:34 +03:00
EXPORT_SYMBOL ( vmalloc_node ) ;
2010-10-27 01:22:06 +04:00
/**
* vzalloc_node - allocate memory on a specific node with zero fill
* @ size : allocation size
* @ node : numa node
*
* Allocate enough pages to cover @ size from the page level
* allocator and map them into contiguous kernel virtual space .
* The memory allocated is set to zero .
*
* For tight control over page level allocator and protection flags
* use __vmalloc ( ) instead .
*/
void * vzalloc_node ( unsigned long size , int node )
{
return vzalloc ( size ) ;
}
EXPORT_SYMBOL ( vzalloc_node ) ;
2005-04-17 02:20:36 +04:00
2008-08-04 11:01:47 +04:00
# ifndef PAGE_KERNEL_EXEC
# define PAGE_KERNEL_EXEC PAGE_KERNEL
# endif
/**
* vmalloc_exec - allocate virtually contiguous , executable memory
* @ size : allocation size
*
* Kernel - internal function to allocate enough pages to cover @ size
* the page level allocator and map them into contiguous and
* executable kernel virtual space .
*
* For tight control over page level allocator and protection flags
* use __vmalloc ( ) instead .
*/
void * vmalloc_exec ( unsigned long size )
{
return __vmalloc ( size , GFP_KERNEL | __GFP_HIGHMEM , PAGE_KERNEL_EXEC ) ;
}
2007-07-21 15:37:25 +04:00
/**
* vmalloc_32 - allocate virtually contiguous memory ( 32 bit addressable )
2005-04-17 02:20:36 +04:00
* @ size : allocation size
*
* Allocate enough 32 bit PA addressable pages to cover @ size from the
2015-07-07 04:14:59 +03:00
* page level allocator and map them into contiguous kernel virtual space .
2005-04-17 02:20:36 +04:00
*/
void * vmalloc_32 ( unsigned long size )
{
return __vmalloc ( size , GFP_KERNEL , PAGE_KERNEL ) ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vmalloc_32 ) ;
/**
* vmalloc_32_user - allocate zeroed virtually contiguous 32 bit memory
* @ size : allocation size
*
* The resulting memory area is 32 bit addressable and zeroed so it can be
* mapped to userspace without leaking data .
2008-02-05 09:29:59 +03:00
*
* VM_USERMAP is set on the corresponding VMA so that subsequent calls to
* remap_vmalloc_range ( ) are permissible .
2007-07-21 15:37:25 +04:00
*/
void * vmalloc_32_user ( unsigned long size )
{
2008-02-05 09:29:59 +03:00
/*
* We ' ll have to sort out the ZONE_DMA bits for 64 - bit ,
* but for now this can simply use vmalloc_user ( ) directly .
*/
return vmalloc_user ( size ) ;
2007-07-21 15:37:25 +04:00
}
EXPORT_SYMBOL ( vmalloc_32_user ) ;
2005-04-17 02:20:36 +04:00
void * vmap ( struct page * * pages , unsigned int count , unsigned long flags , pgprot_t prot )
{
BUG ( ) ;
return NULL ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vmap ) ;
2005-04-17 02:20:36 +04:00
2008-02-05 09:28:32 +03:00
void vunmap ( const void * addr )
2005-04-17 02:20:36 +04:00
{
BUG ( ) ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( vunmap ) ;
2005-04-17 02:20:36 +04:00
2009-01-21 11:45:47 +03:00
void * vm_map_ram ( struct page * * pages , unsigned int count , int node , pgprot_t prot )
{
BUG ( ) ;
return NULL ;
}
EXPORT_SYMBOL ( vm_map_ram ) ;
void vm_unmap_ram ( const void * mem , unsigned int count )
{
BUG ( ) ;
}
EXPORT_SYMBOL ( vm_unmap_ram ) ;
void vm_unmap_aliases ( void )
{
}
EXPORT_SYMBOL_GPL ( vm_unmap_aliases ) ;
2007-05-08 11:27:03 +04:00
/*
* Implement a stub for vmalloc_sync_all ( ) if the architecture chose not to
* have one .
*/
2014-04-08 02:37:26 +04:00
void __weak vmalloc_sync_all ( void )
2007-05-08 11:27:03 +04:00
{
}
2010-12-24 06:08:30 +03:00
/**
* alloc_vm_area - allocate a range of kernel address space
* @ size : size of the area
*
* Returns : NULL on failure , vm_struct on success
*
* This function reserves a range of kernel address space , and
* allocates pagetables to map that range . No actual mappings
* are created . If the kernel address space is not shared
* between processes , it syncs the pagetable across all
* processes .
*/
2011-09-29 19:53:32 +04:00
struct vm_struct * alloc_vm_area ( size_t size , pte_t * * ptes )
2010-12-24 06:08:30 +03:00
{
BUG ( ) ;
return NULL ;
}
EXPORT_SYMBOL_GPL ( alloc_vm_area ) ;
void free_vm_area ( struct vm_struct * area )
{
BUG ( ) ;
}
EXPORT_SYMBOL_GPL ( free_vm_area ) ;
2007-07-21 15:37:25 +04:00
int vm_insert_page ( struct vm_area_struct * vma , unsigned long addr ,
struct page * page )
{
return - EINVAL ;
}
EXPORT_SYMBOL ( vm_insert_page ) ;
2005-04-17 02:20:36 +04:00
/*
* sys_brk ( ) for the most part doesn ' t need the global kernel
* lock , except when an application is doing something nasty
* like trying to un - brk an area that has already been mapped
* to a regular file . in this case , the unmapping will need
* to invoke file system routines that need the global lock .
*/
2009-01-14 16:14:15 +03:00
SYSCALL_DEFINE1 ( brk , unsigned long , brk )
2005-04-17 02:20:36 +04:00
{
struct mm_struct * mm = current - > mm ;
if ( brk < mm - > start_brk | | brk > mm - > context . end_brk )
return mm - > brk ;
if ( mm - > brk = = brk )
return mm - > brk ;
/*
* Always allow shrinking brk
*/
if ( brk < = mm - > brk ) {
mm - > brk = brk ;
return brk ;
}
/*
* Ok , looks good - let it rip .
*/
NOMMU: Avoiding duplicate icache flushes of shared maps
When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.
The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.
However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 20:23:23 +03:00
flush_icache_range ( mm - > brk , brk ) ;
2005-04-17 02:20:36 +04:00
return mm - > brk = brk ;
}
2009-01-08 15:04:47 +03:00
/*
* initialise the VMA and region record slabs
*/
void __init mmap_init ( void )
2005-04-17 02:20:36 +04:00
{
2009-05-01 02:08:51 +04:00
int ret ;
2014-09-08 04:51:29 +04:00
ret = percpu_counter_init ( & vm_committed_as , 0 , GFP_KERNEL ) ;
2009-05-01 02:08:51 +04:00
VM_BUG_ON ( ret ) ;
2016-01-15 02:18:21 +03:00
vm_region_jar = KMEM_CACHE ( vm_region , SLAB_PANIC | SLAB_ACCOUNT ) ;
2005-04-17 02:20:36 +04:00
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* validate the region tree
* - the caller must hold the region lock
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
# ifdef CONFIG_DEBUG_NOMMU_REGIONS
static noinline void validate_nommu_regions ( void )
2006-09-27 12:50:20 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_region * region , * last ;
struct rb_node * p , * lastp ;
2006-09-27 12:50:20 +04:00
2009-01-08 15:04:47 +03:00
lastp = rb_first ( & nommu_region_tree ) ;
if ( ! lastp )
return ;
last = rb_entry ( lastp , struct vm_region , vm_rb ) ;
2015-11-06 05:48:38 +03:00
BUG_ON ( last - > vm_end < = last - > vm_start ) ;
BUG_ON ( last - > vm_top < last - > vm_end ) ;
2009-01-08 15:04:47 +03:00
while ( ( p = rb_next ( lastp ) ) ) {
region = rb_entry ( p , struct vm_region , vm_rb ) ;
last = rb_entry ( lastp , struct vm_region , vm_rb ) ;
2015-11-06 05:48:38 +03:00
BUG_ON ( region - > vm_end < = region - > vm_start ) ;
BUG_ON ( region - > vm_top < region - > vm_end ) ;
BUG_ON ( region - > vm_start < last - > vm_top ) ;
2006-09-27 12:50:20 +04:00
2009-01-08 15:04:47 +03:00
lastp = p ;
}
2006-09-27 12:50:20 +04:00
}
2009-01-08 15:04:47 +03:00
# else
2009-04-03 03:56:32 +04:00
static void validate_nommu_regions ( void )
{
}
2009-01-08 15:04:47 +03:00
# endif
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* add a region into the global tree
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
static void add_nommu_region ( struct vm_region * region )
2006-09-27 12:50:20 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_region * pregion ;
struct rb_node * * p , * parent ;
2006-09-27 12:50:20 +04:00
2009-01-08 15:04:47 +03:00
validate_nommu_regions ( ) ;
parent = NULL ;
p = & nommu_region_tree . rb_node ;
while ( * p ) {
parent = * p ;
pregion = rb_entry ( parent , struct vm_region , vm_rb ) ;
if ( region - > vm_start < pregion - > vm_start )
p = & ( * p ) - > rb_left ;
else if ( region - > vm_start > pregion - > vm_start )
p = & ( * p ) - > rb_right ;
else if ( pregion = = region )
return ;
else
BUG ( ) ;
2006-09-27 12:50:20 +04:00
}
2009-01-08 15:04:47 +03:00
rb_link_node ( & region - > vm_rb , parent , p ) ;
rb_insert_color ( & region - > vm_rb , & nommu_region_tree ) ;
2006-09-27 12:50:20 +04:00
2009-01-08 15:04:47 +03:00
validate_nommu_regions ( ) ;
2006-09-27 12:50:20 +04:00
}
[PATCH] NOMMU: Make futexes work under NOMMU conditions
Make futexes work under NOMMU conditions.
This can be tested by running this in one shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f, n;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
n = *f;
printf("WAIT: %p{%x}\n", f, n);
tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WAITED: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
And then this in the other shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
(*f)++;
printf("WAKE: %p{%x}\n", f, *f);
tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WOKE: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:
SHELL 1 SHELL 2
======================= =======================
# /dowait
WAIT: 0xc32ac000{0}
# /dowake
WAKE: 0xc32ac000{1}
WAITED: 0 WOKE: 1
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 12:50:22 +04:00
/*
2009-01-08 15:04:47 +03:00
* delete a region from the global tree
[PATCH] NOMMU: Make futexes work under NOMMU conditions
Make futexes work under NOMMU conditions.
This can be tested by running this in one shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f, n;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
n = *f;
printf("WAIT: %p{%x}\n", f, n);
tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WAITED: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
And then this in the other shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
(*f)++;
printf("WAKE: %p{%x}\n", f, *f);
tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WOKE: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:
SHELL 1 SHELL 2
======================= =======================
# /dowait
WAIT: 0xc32ac000{0}
# /dowake
WAKE: 0xc32ac000{1}
WAITED: 0 WOKE: 1
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 12:50:22 +04:00
*/
2009-01-08 15:04:47 +03:00
static void delete_nommu_region ( struct vm_region * region )
[PATCH] NOMMU: Make futexes work under NOMMU conditions
Make futexes work under NOMMU conditions.
This can be tested by running this in one shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f, n;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
n = *f;
printf("WAIT: %p{%x}\n", f, n);
tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WAITED: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
And then this in the other shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
(*f)++;
printf("WAKE: %p{%x}\n", f, *f);
tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WOKE: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:
SHELL 1 SHELL 2
======================= =======================
# /dowait
WAIT: 0xc32ac000{0}
# /dowake
WAKE: 0xc32ac000{1}
WAITED: 0 WOKE: 1
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 12:50:22 +04:00
{
2009-01-08 15:04:47 +03:00
BUG_ON ( ! nommu_region_tree . rb_node ) ;
[PATCH] NOMMU: Make futexes work under NOMMU conditions
Make futexes work under NOMMU conditions.
This can be tested by running this in one shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f, n;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
n = *f;
printf("WAIT: %p{%x}\n", f, n);
tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WAITED: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
And then this in the other shell:
#define SYSERROR(X, Y) \
do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)
int main()
{
int shmid, tmp, *f;
shmid = shmget(23, 4, IPC_CREAT|0666);
SYSERROR(shmid, "shmget");
f = shmat(shmid, NULL, 0);
SYSERROR(f, "shmat");
(*f)++;
printf("WAKE: %p{%x}\n", f, *f);
tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
SYSERROR(tmp, "futex");
printf("WOKE: %d\n", tmp);
tmp = shmdt(f);
SYSERROR(tmp, "shmdt");
exit(0);
}
The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:
SHELL 1 SHELL 2
======================= =======================
# /dowait
WAIT: 0xc32ac000{0}
# /dowake
WAKE: 0xc32ac000{1}
WAITED: 0 WOKE: 1
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 12:50:22 +04:00
2009-01-08 15:04:47 +03:00
validate_nommu_regions ( ) ;
rb_erase ( & region - > vm_rb , & nommu_region_tree ) ;
validate_nommu_regions ( ) ;
2007-07-16 10:38:28 +04:00
}
2006-09-27 12:50:21 +04:00
/*
2009-01-08 15:04:47 +03:00
* free a contiguous series of pages
2006-09-27 12:50:21 +04:00
*/
2009-01-08 15:04:47 +03:00
static void free_page_series ( unsigned long from , unsigned long to )
2006-09-27 12:50:21 +04:00
{
2009-01-08 15:04:47 +03:00
for ( ; from < to ; from + = PAGE_SIZE ) {
struct page * page = virt_to_page ( from ) ;
2009-04-03 03:56:32 +04:00
atomic_long_dec ( & mmap_pages_allocated ) ;
2009-01-08 15:04:47 +03:00
put_page ( page ) ;
2006-09-27 12:50:21 +04:00
}
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* release a reference to a region
2009-04-03 03:56:32 +04:00
* - the caller must hold the region semaphore for writing , which this releases
2009-01-08 15:04:47 +03:00
* - the region may not have been added to the tree yet , in which case vm_top
2009-01-08 15:04:47 +03:00
* will equal vm_start
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
static void __put_nommu_region ( struct vm_region * region )
__releases ( nommu_region_sem )
2005-04-17 02:20:36 +04:00
{
2009-01-08 15:04:47 +03:00
BUG_ON ( ! nommu_region_tree . rb_node ) ;
2005-04-17 02:20:36 +04:00
2010-01-16 04:01:33 +03:00
if ( - - region - > vm_usage = = 0 ) {
2009-01-08 15:04:47 +03:00
if ( region - > vm_top > region - > vm_start )
2009-01-08 15:04:47 +03:00
delete_nommu_region ( region ) ;
up_write ( & nommu_region_sem ) ;
if ( region - > vm_file )
fput ( region - > vm_file ) ;
/* IO memory and memory shared directly out of the pagecache
* from ramfs / tmpfs mustn ' t be released here */
2015-06-25 02:57:47 +03:00
if ( region - > vm_flags & VM_MAPPED_COPY )
2009-01-08 15:04:47 +03:00
free_page_series ( region - > vm_start , region - > vm_top ) ;
2009-01-08 15:04:47 +03:00
kmem_cache_free ( vm_region_jar , region ) ;
} else {
up_write ( & nommu_region_sem ) ;
2005-04-17 02:20:36 +04:00
}
2009-01-08 15:04:47 +03:00
}
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/*
* release a reference to a region
*/
static void put_nommu_region ( struct vm_region * region )
{
down_write ( & nommu_region_sem ) ;
__put_nommu_region ( region ) ;
2005-04-17 02:20:36 +04:00
}
2009-09-22 04:03:57 +04:00
/*
* update protection on a vma
*/
static void protect_vma ( struct vm_area_struct * vma , unsigned long flags )
{
# ifdef CONFIG_MPU
struct mm_struct * mm = vma - > vm_mm ;
long start = vma - > vm_start & PAGE_MASK ;
while ( start < vma - > vm_end ) {
protect_page ( mm , start , flags ) ;
start + = PAGE_SIZE ;
}
update_protections ( mm ) ;
# endif
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* add a VMA into a process ' s mm_struct in the appropriate place in the list
* and tree and add to the address space ' s page tree also if not an anonymous
* page
* - should be called with mm - > mmap_sem held writelocked
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
static void add_vma_to_mm ( struct mm_struct * mm , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
struct vm_area_struct * pvma , * prev ;
2005-04-17 02:20:36 +04:00
struct address_space * mapping ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
struct rb_node * * p , * parent , * rb_prev ;
2009-01-08 15:04:47 +03:00
BUG_ON ( ! vma - > vm_region ) ;
mm - > map_count + + ;
vma - > vm_mm = mm ;
2005-04-17 02:20:36 +04:00
2009-09-22 04:03:57 +04:00
protect_vma ( vma , vma - > vm_flags ) ;
2005-04-17 02:20:36 +04:00
/* add the VMA to the mapping */
if ( vma - > vm_file ) {
mapping = vma - > vm_file - > f_mapping ;
2014-12-13 03:54:21 +03:00
i_mmap_lock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_lock ( mapping ) ;
2012-10-09 03:31:25 +04:00
vma_interval_tree_insert ( vma , & mapping - > i_mmap ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_unlock ( mapping ) ;
2014-12-13 03:54:21 +03:00
i_mmap_unlock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
}
2009-01-08 15:04:47 +03:00
/* add the VMA to the tree */
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
parent = rb_prev = NULL ;
2009-01-08 15:04:47 +03:00
p = & mm - > mm_rb . rb_node ;
2005-04-17 02:20:36 +04:00
while ( * p ) {
parent = * p ;
pvma = rb_entry ( parent , struct vm_area_struct , vm_rb ) ;
2009-01-08 15:04:47 +03:00
/* sort by: start addr, end addr, VMA struct addr in that order
* ( the latter is necessary as we may get identical VMAs ) */
if ( vma - > vm_start < pvma - > vm_start )
2005-04-17 02:20:36 +04:00
p = & ( * p ) - > rb_left ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
else if ( vma - > vm_start > pvma - > vm_start ) {
rb_prev = parent ;
2005-04-17 02:20:36 +04:00
p = & ( * p ) - > rb_right ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
} else if ( vma - > vm_end < pvma - > vm_end )
2009-01-08 15:04:47 +03:00
p = & ( * p ) - > rb_left ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
else if ( vma - > vm_end > pvma - > vm_end ) {
rb_prev = parent ;
2009-01-08 15:04:47 +03:00
p = & ( * p ) - > rb_right ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
} else if ( vma < pvma )
2009-01-08 15:04:47 +03:00
p = & ( * p ) - > rb_left ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
else if ( vma > pvma ) {
rb_prev = parent ;
2009-01-08 15:04:47 +03:00
p = & ( * p ) - > rb_right ;
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
} else
2009-01-08 15:04:47 +03:00
BUG ( ) ;
2005-04-17 02:20:36 +04:00
}
rb_link_node ( & vma - > vm_rb , parent , p ) ;
2009-01-08 15:04:47 +03:00
rb_insert_color ( & vma - > vm_rb , & mm - > mm_rb ) ;
/* add VMA to the VMA list also */
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
prev = NULL ;
if ( rb_prev )
prev = rb_entry ( rb_prev , struct vm_area_struct , vm_rb ) ;
2009-01-08 15:04:47 +03:00
mm: nommu: sort mm->mmap list properly
When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).
Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.
Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.
Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.
This patch:
@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.
This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.
[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 04:11:22 +04:00
__vma_link_list ( mm , vma , prev , parent ) ;
2005-04-17 02:20:36 +04:00
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* delete a VMA from its owning mm_struct and address space
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
static void delete_vma_from_mm ( struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
int i ;
2005-04-17 02:20:36 +04:00
struct address_space * mapping ;
2009-01-08 15:04:47 +03:00
struct mm_struct * mm = vma - > vm_mm ;
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
struct task_struct * curr = current ;
2009-01-08 15:04:47 +03:00
2009-09-22 04:03:57 +04:00
protect_vma ( vma , 0 ) ;
2009-01-08 15:04:47 +03:00
mm - > map_count - - ;
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
for ( i = 0 ; i < VMACACHE_SIZE ; i + + ) {
/* if the vma is cached, invalidate the entire cache */
if ( curr - > vmacache [ i ] = = vma ) {
2014-06-24 00:22:02 +04:00
vmacache_invalidate ( mm ) ;
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
break ;
}
}
2005-04-17 02:20:36 +04:00
/* remove the VMA from the mapping */
if ( vma - > vm_file ) {
mapping = vma - > vm_file - > f_mapping ;
2014-12-13 03:54:21 +03:00
i_mmap_lock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_lock ( mapping ) ;
2012-10-09 03:31:25 +04:00
vma_interval_tree_remove ( vma , & mapping - > i_mmap ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_unlock ( mapping ) ;
2014-12-13 03:54:21 +03:00
i_mmap_unlock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
}
2009-01-08 15:04:47 +03:00
/* remove from the MM's tree and list */
rb_erase ( & vma - > vm_rb , & mm - > mm_rb ) ;
2011-05-25 04:11:23 +04:00
if ( vma - > vm_prev )
vma - > vm_prev - > vm_next = vma - > vm_next ;
else
mm - > mmap = vma - > vm_next ;
if ( vma - > vm_next )
vma - > vm_next - > vm_prev = vma - > vm_prev ;
2009-01-08 15:04:47 +03:00
}
/*
* destroy a VMA record
*/
static void delete_vma ( struct mm_struct * mm , struct vm_area_struct * vma )
{
if ( vma - > vm_ops & & vma - > vm_ops - > close )
vma - > vm_ops - > close ( vma ) ;
2012-10-09 03:28:54 +04:00
if ( vma - > vm_file )
2009-01-08 15:04:47 +03:00
fput ( vma - > vm_file ) ;
put_nommu_region ( vma - > vm_region ) ;
kmem_cache_free ( vm_area_cachep , vma ) ;
}
/*
* look up the first VMA in which addr resides , NULL if none
* - should be called with mm - > mmap_sem at least held readlocked
*/
struct vm_area_struct * find_vma ( struct mm_struct * mm , unsigned long addr )
{
struct vm_area_struct * vma ;
/* check the cache first */
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
vma = vmacache_find ( mm , addr ) ;
if ( likely ( vma ) )
2009-01-08 15:04:47 +03:00
return vma ;
2011-05-25 04:11:24 +04:00
/* trawl the list (there may be multiple mappings in which addr
2009-01-08 15:04:47 +03:00
* resides ) */
2011-05-25 04:11:24 +04:00
for ( vma = mm - > mmap ; vma ; vma = vma - > vm_next ) {
2009-01-08 15:04:47 +03:00
if ( vma - > vm_start > addr )
return NULL ;
if ( vma - > vm_end > addr ) {
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
vmacache_update ( addr , vma ) ;
2009-01-08 15:04:47 +03:00
return vma ;
}
}
return NULL ;
}
EXPORT_SYMBOL ( find_vma ) ;
/*
* find a VMA
* - we don ' t extend stack VMAs under NOMMU conditions
*/
struct vm_area_struct * find_extend_vma ( struct mm_struct * mm , unsigned long addr )
{
2010-03-25 19:48:38 +03:00
return find_vma ( mm , addr ) ;
2009-01-08 15:04:47 +03:00
}
/*
* expand a stack to a given address
* - not supported under NOMMU conditions
*/
int expand_stack ( struct vm_area_struct * vma , unsigned long address )
{
return - ENOMEM ;
}
/*
* look up the first VMA exactly that exactly matches addr
* - should be called with mm - > mmap_sem at least held readlocked
*/
static struct vm_area_struct * find_vma_exact ( struct mm_struct * mm ,
unsigned long addr ,
unsigned long len )
{
struct vm_area_struct * vma ;
unsigned long end = addr + len ;
/* check the cache first */
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
vma = vmacache_find_exact ( mm , addr , end ) ;
if ( vma )
2009-01-08 15:04:47 +03:00
return vma ;
2011-05-25 04:11:24 +04:00
/* trawl the list (there may be multiple mappings in which addr
2009-01-08 15:04:47 +03:00
* resides ) */
2011-05-25 04:11:24 +04:00
for ( vma = mm - > mmap ; vma ; vma = vma - > vm_next ) {
2009-01-08 15:04:47 +03:00
if ( vma - > vm_start < addr )
continue ;
if ( vma - > vm_start > addr )
return NULL ;
if ( vma - > vm_end = = end ) {
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:37:25 +04:00
vmacache_update ( addr , vma ) ;
2009-01-08 15:04:47 +03:00
return vma ;
}
}
return NULL ;
2005-04-17 02:20:36 +04:00
}
/*
* determine whether a mapping should be permitted and , if so , what sort of
* mapping we ' re capable of supporting
*/
static int validate_mmap_request ( struct file * file ,
unsigned long addr ,
unsigned long len ,
unsigned long prot ,
unsigned long flags ,
unsigned long pgoff ,
unsigned long * _capabilities )
{
2009-01-08 15:04:47 +03:00
unsigned long capabilities , rlen ;
2005-04-17 02:20:36 +04:00
int ret ;
/* do the simple checks first */
2015-06-25 02:57:47 +03:00
if ( flags & MAP_FIXED )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
if ( ( flags & MAP_TYPE ) ! = MAP_PRIVATE & &
( flags & MAP_TYPE ) ! = MAP_SHARED )
return - EINVAL ;
2006-12-06 05:02:59 +03:00
if ( ! len )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2006-12-06 05:02:59 +03:00
/* Careful about overflows.. */
2009-01-08 15:04:47 +03:00
rlen = PAGE_ALIGN ( len ) ;
if ( ! rlen | | rlen > TASK_SIZE )
2006-12-06 05:02:59 +03:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
/* offset overflow? */
2009-01-08 15:04:47 +03:00
if ( ( pgoff + ( rlen > > PAGE_SHIFT ) ) < pgoff )
2006-12-06 05:02:59 +03:00
return - EOVERFLOW ;
2005-04-17 02:20:36 +04:00
if ( file ) {
/* files must support mmap */
2013-09-23 00:27:52 +04:00
if ( ! file - > f_op - > mmap )
2005-04-17 02:20:36 +04:00
return - ENODEV ;
/* work out if what we've got could possibly be shared
* - we support chardevs that provide their own " memory "
* - we support files / blockdevs that are memory backed
*/
2015-01-14 12:42:32 +03:00
if ( file - > f_op - > mmap_capabilities ) {
capabilities = file - > f_op - > mmap_capabilities ( file ) ;
} else {
2005-04-17 02:20:36 +04:00
/* no explicit capabilities set, so assume some
* defaults */
2013-01-24 02:07:38 +04:00
switch ( file_inode ( file ) - > i_mode & S_IFMT ) {
2005-04-17 02:20:36 +04:00
case S_IFREG :
case S_IFBLK :
2015-01-14 12:42:32 +03:00
capabilities = NOMMU_MAP_COPY ;
2005-04-17 02:20:36 +04:00
break ;
case S_IFCHR :
capabilities =
2015-01-14 12:42:32 +03:00
NOMMU_MAP_DIRECT |
NOMMU_MAP_READ |
NOMMU_MAP_WRITE ;
2005-04-17 02:20:36 +04:00
break ;
default :
return - EINVAL ;
}
}
/* eliminate any capabilities that we can't support on this
* device */
if ( ! file - > f_op - > get_unmapped_area )
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_DIRECT ;
2015-03-31 19:35:13 +03:00
if ( ! ( file - > f_mode & FMODE_CAN_READ ) )
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_COPY ;
2005-04-17 02:20:36 +04:00
2009-08-19 01:11:17 +04:00
/* The file shall have been opened with read permission. */
if ( ! ( file - > f_mode & FMODE_READ ) )
return - EACCES ;
2005-04-17 02:20:36 +04:00
if ( flags & MAP_SHARED ) {
/* do checks for writing, appending and locking */
if ( ( prot & PROT_WRITE ) & &
! ( file - > f_mode & FMODE_WRITE ) )
return - EACCES ;
2013-01-24 02:07:38 +04:00
if ( IS_APPEND ( file_inode ( file ) ) & &
2005-04-17 02:20:36 +04:00
( file - > f_mode & FMODE_WRITE ) )
return - EACCES ;
2014-03-10 17:54:15 +04:00
if ( locks_verify_locked ( file ) )
2005-04-17 02:20:36 +04:00
return - EAGAIN ;
2015-01-14 12:42:32 +03:00
if ( ! ( capabilities & NOMMU_MAP_DIRECT ) )
2005-04-17 02:20:36 +04:00
return - ENODEV ;
/* we mustn't privatise shared mappings */
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_COPY ;
2014-04-08 02:37:36 +04:00
} else {
2005-04-17 02:20:36 +04:00
/* we're going to read the file into private memory we
* allocate */
2015-01-14 12:42:32 +03:00
if ( ! ( capabilities & NOMMU_MAP_COPY ) )
2005-04-17 02:20:36 +04:00
return - ENODEV ;
/* we don't permit a private writable mapping to be
* shared with the backing device */
if ( prot & PROT_WRITE )
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_DIRECT ;
2005-04-17 02:20:36 +04:00
}
2015-01-14 12:42:32 +03:00
if ( capabilities & NOMMU_MAP_DIRECT ) {
if ( ( ( prot & PROT_READ ) & & ! ( capabilities & NOMMU_MAP_READ ) ) | |
( ( prot & PROT_WRITE ) & & ! ( capabilities & NOMMU_MAP_WRITE ) ) | |
( ( prot & PROT_EXEC ) & & ! ( capabilities & NOMMU_MAP_EXEC ) )
2010-05-26 10:43:00 +04:00
) {
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_DIRECT ;
2010-05-26 10:43:00 +04:00
if ( flags & MAP_SHARED ) {
2015-06-25 02:57:47 +03:00
pr_warn ( " MAP_SHARED not completely supported on !MMU \n " ) ;
2010-05-26 10:43:00 +04:00
return - EINVAL ;
}
}
}
2005-04-17 02:20:36 +04:00
/* handle executable mappings and implied executable
* mappings */
2015-06-29 22:42:03 +03:00
if ( path_noexec ( & file - > f_path ) ) {
2005-04-17 02:20:36 +04:00
if ( prot & PROT_EXEC )
return - EPERM ;
2014-04-08 02:37:36 +04:00
} else if ( ( prot & PROT_READ ) & & ! ( prot & PROT_EXEC ) ) {
2005-04-17 02:20:36 +04:00
/* handle implication of PROT_EXEC by PROT_READ */
if ( current - > personality & READ_IMPLIES_EXEC ) {
2015-01-14 12:42:32 +03:00
if ( capabilities & NOMMU_MAP_EXEC )
2005-04-17 02:20:36 +04:00
prot | = PROT_EXEC ;
}
2014-04-08 02:37:36 +04:00
} else if ( ( prot & PROT_READ ) & &
2005-04-17 02:20:36 +04:00
( prot & PROT_EXEC ) & &
2015-01-14 12:42:32 +03:00
! ( capabilities & NOMMU_MAP_EXEC )
2005-04-17 02:20:36 +04:00
) {
/* backing file is not executable, try to copy */
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_DIRECT ;
2005-04-17 02:20:36 +04:00
}
2014-04-08 02:37:36 +04:00
} else {
2005-04-17 02:20:36 +04:00
/* anonymous mappings are always memory backed and can be
* privately mapped
*/
2015-01-14 12:42:32 +03:00
capabilities = NOMMU_MAP_COPY ;
2005-04-17 02:20:36 +04:00
/* handle PROT_EXEC implication by PROT_READ */
if ( ( prot & PROT_READ ) & &
( current - > personality & READ_IMPLIES_EXEC ) )
prot | = PROT_EXEC ;
}
/* allow the security API to have its say */
2012-05-30 21:30:51 +04:00
ret = security_mmap_addr ( addr ) ;
2005-04-17 02:20:36 +04:00
if ( ret < 0 )
return ret ;
/* looks okay */
* _capabilities = capabilities ;
return 0 ;
}
/*
* we ' ve determined that we can make the mapping , now translate what we
* now know into VMA flags
*/
static unsigned long determine_vm_flags ( struct file * file ,
unsigned long prot ,
unsigned long flags ,
unsigned long capabilities )
{
unsigned long vm_flags ;
vm_flags = calc_vm_prot_bits ( prot ) | calc_vm_flag_bits ( flags ) ;
/* vm_flags |= mm->def_flags; */
2015-01-14 12:42:32 +03:00
if ( ! ( capabilities & NOMMU_MAP_DIRECT ) ) {
2005-04-17 02:20:36 +04:00
/* attempt to share read-only copies of mapped file chunks */
2010-05-26 10:43:00 +04:00
vm_flags | = VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC ;
2005-04-17 02:20:36 +04:00
if ( file & & ! ( prot & PROT_WRITE ) )
vm_flags | = VM_MAYSHARE ;
2010-05-26 10:43:00 +04:00
} else {
2005-04-17 02:20:36 +04:00
/* overlay a shareable mapping on the backing device or inode
* if possible - used for chardevs , ramfs / tmpfs / shmfs and
* romfs / cramfs */
2015-01-14 12:42:32 +03:00
vm_flags | = VM_MAYSHARE | ( capabilities & NOMMU_VMFLAGS ) ;
2005-04-17 02:20:36 +04:00
if ( flags & MAP_SHARED )
2010-05-26 10:43:00 +04:00
vm_flags | = VM_SHARED ;
2005-04-17 02:20:36 +04:00
}
/* refuse to let anyone share private mappings with this process if
* it ' s being traced - otherwise breakpoints set in it may interfere
* with another untraced process
*/
2011-06-17 18:50:37 +04:00
if ( ( flags & MAP_PRIVATE ) & & current - > ptrace )
2005-04-17 02:20:36 +04:00
vm_flags & = ~ VM_MAYSHARE ;
return vm_flags ;
}
/*
2009-01-08 15:04:47 +03:00
* set up a shared mapping on a file ( the driver or filesystem provides and
* pins the storage )
2005-04-17 02:20:36 +04:00
*/
2009-01-08 15:04:47 +03:00
static int do_mmap_shared_file ( struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
int ret ;
ret = vma - > vm_file - > f_op - > mmap ( vma - > vm_file , vma ) ;
2009-01-08 15:04:47 +03:00
if ( ret = = 0 ) {
vma - > vm_region - > vm_top = vma - > vm_region - > vm_end ;
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
return 0 ;
2009-01-08 15:04:47 +03:00
}
2005-04-17 02:20:36 +04:00
if ( ret ! = - ENOSYS )
return ret ;
2010-03-23 23:35:21 +03:00
/* getting -ENOSYS indicates that direct mmap isn't possible (as
* opposed to tried but failed ) so we can only give a suitable error as
* it ' s not possible to make a private copy if MAP_SHARED was given */
2005-04-17 02:20:36 +04:00
return - ENODEV ;
}
/*
* set up a private mapping or an anonymous shared mapping
*/
2009-01-08 15:04:47 +03:00
static int do_mmap_private ( struct vm_area_struct * vma ,
struct vm_region * region ,
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
unsigned long len ,
unsigned long capabilities )
2005-04-17 02:20:36 +04:00
{
2014-12-13 03:55:55 +03:00
unsigned long total , point ;
2005-04-17 02:20:36 +04:00
void * base ;
2009-01-08 15:04:47 +03:00
int ret , order ;
2005-04-17 02:20:36 +04:00
/* invoke the file's mapping function so that it can keep track of
* shared mappings on devices or memory
* - VM_MAYSHARE will be set if it may attempt to share
*/
2015-01-14 12:42:32 +03:00
if ( capabilities & NOMMU_MAP_DIRECT ) {
2005-04-17 02:20:36 +04:00
ret = vma - > vm_file - > f_op - > mmap ( vma - > vm_file , vma ) ;
2009-01-08 15:04:47 +03:00
if ( ret = = 0 ) {
2005-04-17 02:20:36 +04:00
/* shouldn't return success if we're not sharing */
2009-01-08 15:04:47 +03:00
BUG_ON ( ! ( vma - > vm_flags & VM_MAYSHARE ) ) ;
vma - > vm_region - > vm_top = vma - > vm_region - > vm_end ;
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2009-01-08 15:04:47 +03:00
if ( ret ! = - ENOSYS )
return ret ;
2005-04-17 02:20:36 +04:00
/* getting an ENOSYS error indicates that direct mmap isn't
* possible ( as opposed to tried but failed ) so we ' ll try to
* make a private copy of the data and map that instead */
}
2009-01-08 15:04:47 +03:00
2005-04-17 02:20:36 +04:00
/* allocate some memory to hold the mapping
* - note that this may not return a page - aligned address if the object
* we ' re allocating is smaller than a page
*/
2011-05-25 04:12:56 +04:00
order = get_order ( len ) ;
2009-01-08 15:04:47 +03:00
total = 1 < < order ;
2011-05-25 04:12:56 +04:00
point = len > > PAGE_SHIFT ;
2009-01-08 15:04:47 +03:00
2014-12-13 03:55:55 +03:00
/* we don't want to allocate a power-of-2 sized page set */
2015-06-25 02:57:47 +03:00
if ( sysctl_nr_trim_pages & & total - point > = sysctl_nr_trim_pages )
2014-12-13 03:55:55 +03:00
total = point ;
2009-01-08 15:04:47 +03:00
2015-02-28 02:51:43 +03:00
base = alloc_pages_exact ( total < < PAGE_SHIFT , GFP_KERNEL ) ;
2014-12-13 03:55:55 +03:00
if ( ! base )
goto enomem ;
atomic_long_add ( total , & mmap_pages_allocated ) ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
region - > vm_flags = vma - > vm_flags | = VM_MAPPED_COPY ;
region - > vm_start = ( unsigned long ) base ;
2011-05-25 04:12:56 +04:00
region - > vm_end = region - > vm_start + len ;
2009-01-08 15:04:47 +03:00
region - > vm_top = region - > vm_start + ( total < < PAGE_SHIFT ) ;
2009-01-08 15:04:47 +03:00
vma - > vm_start = region - > vm_start ;
vma - > vm_end = region - > vm_start + len ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_file ) {
/* read the contents of a file into the copy */
mm_segment_t old_fs ;
loff_t fpos ;
fpos = vma - > vm_pgoff ;
fpos < < = PAGE_SHIFT ;
old_fs = get_fs ( ) ;
set_fs ( KERNEL_DS ) ;
2015-03-31 19:35:13 +03:00
ret = __vfs_read ( vma - > vm_file , base , len , & fpos ) ;
2005-04-17 02:20:36 +04:00
set_fs ( old_fs ) ;
if ( ret < 0 )
goto error_free ;
/* clear the last little bit */
2011-05-25 04:12:56 +04:00
if ( ret < len )
memset ( base + ret , 0 , len - ret ) ;
2005-04-17 02:20:36 +04:00
}
return 0 ;
error_free :
2011-05-25 04:11:26 +04:00
free_page_series ( region - > vm_start , region - > vm_top ) ;
2009-01-08 15:04:47 +03:00
region - > vm_start = vma - > vm_start = 0 ;
region - > vm_end = vma - > vm_end = 0 ;
2009-01-08 15:04:47 +03:00
region - > vm_top = 0 ;
2005-04-17 02:20:36 +04:00
return ret ;
enomem :
2014-06-07 01:38:30 +04:00
pr_err ( " Allocation of length %lu from process %d (%s) failed \n " ,
2009-01-13 10:30:22 +03:00
len , current - > pid , current - > comm ) ;
2011-05-25 04:11:16 +04:00
show_free_areas ( 0 ) ;
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
}
/*
* handle mapping creation for uClinux
*/
2015-09-10 01:39:29 +03:00
unsigned long do_mmap ( struct file * file ,
unsigned long addr ,
unsigned long len ,
unsigned long prot ,
unsigned long flags ,
vm_flags_t vm_flags ,
unsigned long pgoff ,
unsigned long * populate )
2005-04-17 02:20:36 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_area_struct * vma ;
struct vm_region * region ;
2005-04-17 02:20:36 +04:00
struct rb_node * rb ;
2015-09-10 01:39:29 +03:00
unsigned long capabilities , result ;
2005-04-17 02:20:36 +04:00
int ret ;
2013-02-23 04:32:47 +04:00
* populate = 0 ;
2013-02-23 04:32:37 +04:00
2005-04-17 02:20:36 +04:00
/* decide whether we should attempt the mapping, and if so what sort of
* mapping */
ret = validate_mmap_request ( file , addr , len , prot , flags , pgoff ,
& capabilities ) ;
2015-06-25 02:57:47 +03:00
if ( ret < 0 )
2005-04-17 02:20:36 +04:00
return ret ;
2009-09-24 15:33:48 +04:00
/* we ignore the address hint */
addr = 0 ;
2011-05-25 04:12:56 +04:00
len = PAGE_ALIGN ( len ) ;
2009-09-24 15:33:48 +04:00
2005-04-17 02:20:36 +04:00
/* we've determined that we can make the mapping, now translate what we
* now know into VMA flags */
2015-09-10 01:39:29 +03:00
vm_flags | = determine_vm_flags ( file , prot , flags , capabilities ) ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/* we're going to need to record the mapping */
region = kmem_cache_zalloc ( vm_region_jar , GFP_KERNEL ) ;
if ( ! region )
goto error_getting_region ;
vma = kmem_cache_zalloc ( vm_area_cachep , GFP_KERNEL ) ;
if ( ! vma )
goto error_getting_vma ;
2005-04-17 02:20:36 +04:00
2010-01-16 04:01:33 +03:00
region - > vm_usage = 1 ;
2009-01-08 15:04:47 +03:00
region - > vm_flags = vm_flags ;
region - > vm_pgoff = pgoff ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
INIT_LIST_HEAD ( & vma - > anon_vma_chain ) ;
2009-01-08 15:04:47 +03:00
vma - > vm_flags = vm_flags ;
vma - > vm_pgoff = pgoff ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
if ( file ) {
2012-08-27 22:48:26 +04:00
region - > vm_file = get_file ( file ) ;
vma - > vm_file = get_file ( file ) ;
2009-01-08 15:04:47 +03:00
}
down_write ( & nommu_region_sem ) ;
/* if we want to share, we need to check for regions created by other
2005-04-17 02:20:36 +04:00
* mmap ( ) calls that overlap with our proposed mapping
2009-01-08 15:04:47 +03:00
* - we can only share with a superset match on most regular files
2005-04-17 02:20:36 +04:00
* - shared mappings on character devices and memory backed files are
* permitted to overlap inexactly as far as we are concerned for in
* these cases , sharing is handled in the driver or filesystem rather
* than here
*/
if ( vm_flags & VM_MAYSHARE ) {
2009-01-08 15:04:47 +03:00
struct vm_region * pregion ;
unsigned long pglen , rpglen , pgend , rpgend , start ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
pglen = ( len + PAGE_SIZE - 1 ) > > PAGE_SHIFT ;
pgend = pgoff + pglen ;
2007-03-22 11:11:24 +03:00
2009-01-08 15:04:47 +03:00
for ( rb = rb_first ( & nommu_region_tree ) ; rb ; rb = rb_next ( rb ) ) {
pregion = rb_entry ( rb , struct vm_region , vm_rb ) ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
if ( ! ( pregion - > vm_flags & VM_MAYSHARE ) )
2005-04-17 02:20:36 +04:00
continue ;
/* search for overlapping mappings on the same file */
2013-01-24 02:07:38 +04:00
if ( file_inode ( pregion - > vm_file ) ! =
file_inode ( file ) )
2005-04-17 02:20:36 +04:00
continue ;
2009-01-08 15:04:47 +03:00
if ( pregion - > vm_pgoff > = pgend )
2005-04-17 02:20:36 +04:00
continue ;
2009-01-08 15:04:47 +03:00
rpglen = pregion - > vm_end - pregion - > vm_start ;
rpglen = ( rpglen + PAGE_SIZE - 1 ) > > PAGE_SHIFT ;
rpgend = pregion - > vm_pgoff + rpglen ;
if ( pgoff > = rpgend )
2005-04-17 02:20:36 +04:00
continue ;
2009-01-08 15:04:47 +03:00
/* handle inexactly overlapping matches between
* mappings */
if ( ( pregion - > vm_pgoff ! = pgoff | | rpglen ! = pglen ) & &
! ( pgoff > = pregion - > vm_pgoff & & pgend < = rpgend ) ) {
/* new mapping is not a subset of the region */
2015-01-14 12:42:32 +03:00
if ( ! ( capabilities & NOMMU_MAP_DIRECT ) )
2005-04-17 02:20:36 +04:00
goto sharing_violation ;
continue ;
}
2009-01-08 15:04:47 +03:00
/* we've found a region we can share */
2010-01-16 04:01:33 +03:00
pregion - > vm_usage + + ;
2009-01-08 15:04:47 +03:00
vma - > vm_region = pregion ;
start = pregion - > vm_start ;
start + = ( pgoff - pregion - > vm_pgoff ) < < PAGE_SHIFT ;
vma - > vm_start = start ;
vma - > vm_end = start + len ;
2015-06-25 02:57:47 +03:00
if ( pregion - > vm_flags & VM_MAPPED_COPY )
2009-01-08 15:04:47 +03:00
vma - > vm_flags | = VM_MAPPED_COPY ;
2015-06-25 02:57:47 +03:00
else {
2009-01-08 15:04:47 +03:00
ret = do_mmap_shared_file ( vma ) ;
if ( ret < 0 ) {
vma - > vm_region = NULL ;
vma - > vm_start = 0 ;
vma - > vm_end = 0 ;
2010-01-16 04:01:33 +03:00
pregion - > vm_usage - - ;
2009-01-08 15:04:47 +03:00
pregion = NULL ;
goto error_just_free ;
}
}
fput ( region - > vm_file ) ;
kmem_cache_free ( vm_region_jar , region ) ;
region = pregion ;
result = start ;
goto share ;
2005-04-17 02:20:36 +04:00
}
/* obtain the address at which to make a shared mapping
* - this is the hook for quasi - memory character devices to
* tell us the location of a shared mapping
*/
2015-01-14 12:42:32 +03:00
if ( capabilities & NOMMU_MAP_DIRECT ) {
2005-04-17 02:20:36 +04:00
addr = file - > f_op - > get_unmapped_area ( file , addr , len ,
pgoff , flags ) ;
2011-05-25 04:11:27 +04:00
if ( IS_ERR_VALUE ( addr ) ) {
2005-04-17 02:20:36 +04:00
ret = addr ;
2011-05-25 04:11:27 +04:00
if ( ret ! = - ENOSYS )
2009-01-08 15:04:47 +03:00
goto error_just_free ;
2005-04-17 02:20:36 +04:00
/* the driver refused to tell us where to site
* the mapping so we ' ll have to attempt to copy
* it */
2011-05-25 04:11:27 +04:00
ret = - ENODEV ;
2015-01-14 12:42:32 +03:00
if ( ! ( capabilities & NOMMU_MAP_COPY ) )
2009-01-08 15:04:47 +03:00
goto error_just_free ;
2005-04-17 02:20:36 +04:00
2015-01-14 12:42:32 +03:00
capabilities & = ~ NOMMU_MAP_DIRECT ;
2009-01-08 15:04:47 +03:00
} else {
vma - > vm_start = region - > vm_start = addr ;
vma - > vm_end = region - > vm_end = addr + len ;
2005-04-17 02:20:36 +04:00
}
}
}
2009-01-08 15:04:47 +03:00
vma - > vm_region = region ;
2005-04-17 02:20:36 +04:00
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
/* set up the mapping
2015-01-14 12:42:32 +03:00
* - the region is filled in if NOMMU_MAP_DIRECT is still set
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
*/
2005-04-17 02:20:36 +04:00
if ( file & & vma - > vm_flags & VM_SHARED )
2009-01-08 15:04:47 +03:00
ret = do_mmap_shared_file ( vma ) ;
2005-04-17 02:20:36 +04:00
else
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
ret = do_mmap_private ( vma , region , len , capabilities ) ;
2005-04-17 02:20:36 +04:00
if ( ret < 0 )
NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly
Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly. Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly. This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().
The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it. It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.
Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:
Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0 0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1 0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2 0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3 0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4 0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5 0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6 0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7 0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
Note that this fix does the following commit differently:
commit a190887b58c32d19c2eee007c5eb8faa970a69ba
Author: David Howells <dhowells@redhat.com>
Date: Sat Sep 5 11:17:07 2009 -0700
nommu: fix error handling in do_mmap_pgoff()
Reported-by: Graff Yang <graff.yang@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 18:13:10 +04:00
goto error_just_free ;
add_nommu_region ( region ) ;
2009-01-08 15:04:47 +03:00
2009-12-15 05:00:02 +03:00
/* clear anonymous mappings that don't ask for uninitialized data */
if ( ! vma - > vm_file & & ! ( flags & MAP_UNINITIALIZED ) )
memset ( ( void * ) region - > vm_start , 0 ,
region - > vm_end - region - > vm_start ) ;
2005-04-17 02:20:36 +04:00
/* okay... we have a mapping; now we have to register it */
2009-01-08 15:04:47 +03:00
result = vma - > vm_start ;
2005-04-17 02:20:36 +04:00
current - > mm - > total_vm + = len > > PAGE_SHIFT ;
2009-01-08 15:04:47 +03:00
share :
add_vma_to_mm ( current - > mm , vma ) ;
2005-04-17 02:20:36 +04:00
NOMMU: Avoiding duplicate icache flushes of shared maps
When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.
The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.
However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 20:23:23 +03:00
/* we flush the region from the icache only when the first executable
* mapping of it is made */
if ( vma - > vm_flags & VM_EXEC & & ! region - > vm_icache_flushed ) {
flush_icache_range ( region - > vm_start , region - > vm_end ) ;
region - > vm_icache_flushed = true ;
}
2005-04-17 02:20:36 +04:00
NOMMU: Avoiding duplicate icache flushes of shared maps
When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.
The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.
However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 20:23:23 +03:00
up_write ( & nommu_region_sem ) ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
return result ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
error_just_free :
up_write ( & nommu_region_sem ) ;
error :
2009-10-30 16:13:26 +03:00
if ( region - > vm_file )
fput ( region - > vm_file ) ;
2009-01-08 15:04:47 +03:00
kmem_cache_free ( vm_region_jar , region ) ;
2009-10-30 16:13:26 +03:00
if ( vma - > vm_file )
fput ( vma - > vm_file ) ;
2009-01-08 15:04:47 +03:00
kmem_cache_free ( vm_area_cachep , vma ) ;
return ret ;
sharing_violation :
up_write ( & nommu_region_sem ) ;
2015-06-25 02:57:47 +03:00
pr_warn ( " Attempt to share mismatched mappings \n " ) ;
2009-01-08 15:04:47 +03:00
ret = - EINVAL ;
goto error ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
error_getting_vma :
kmem_cache_free ( vm_region_jar , region ) ;
2015-06-25 02:57:47 +03:00
pr_warn ( " Allocation of vma for %lu byte allocation from process %d failed \n " ,
len , current - > pid ) ;
2011-05-25 04:11:16 +04:00
show_free_areas ( 0 ) ;
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
2009-01-08 15:04:47 +03:00
error_getting_region :
2015-06-25 02:57:47 +03:00
pr_warn ( " Allocation of vm region for %lu byte allocation from process %d failed \n " ,
len , current - > pid ) ;
2011-05-25 04:11:16 +04:00
show_free_areas ( 0 ) ;
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
}
2012-04-21 04:13:58 +04:00
2009-12-30 23:17:34 +03:00
SYSCALL_DEFINE6 ( mmap_pgoff , unsigned long , addr , unsigned long , len ,
unsigned long , prot , unsigned long , flags ,
unsigned long , fd , unsigned long , pgoff )
{
struct file * file = NULL ;
unsigned long retval = - EBADF ;
2010-10-30 10:54:44 +04:00
audit_mmap_fd ( fd , flags ) ;
2009-12-30 23:17:34 +03:00
if ( ! ( flags & MAP_ANONYMOUS ) ) {
file = fget ( fd ) ;
if ( ! file )
goto out ;
}
flags & = ~ ( MAP_EXECUTABLE | MAP_DENYWRITE ) ;
2012-06-04 08:29:59 +04:00
retval = vm_mmap_pgoff ( file , addr , len , prot , flags , pgoff ) ;
2009-12-30 23:17:34 +03:00
if ( file )
fput ( file ) ;
out :
return retval ;
}
2010-03-11 02:21:15 +03:00
# ifdef __ARCH_WANT_SYS_OLD_MMAP
struct mmap_arg_struct {
unsigned long addr ;
unsigned long len ;
unsigned long prot ;
unsigned long flags ;
unsigned long fd ;
unsigned long offset ;
} ;
SYSCALL_DEFINE1 ( old_mmap , struct mmap_arg_struct __user * , arg )
{
struct mmap_arg_struct a ;
if ( copy_from_user ( & a , arg , sizeof ( a ) ) )
return - EFAULT ;
2015-11-06 05:46:35 +03:00
if ( offset_in_page ( a . offset ) )
2010-03-11 02:21:15 +03:00
return - EINVAL ;
return sys_mmap_pgoff ( a . addr , a . len , a . prot , a . flags , a . fd ,
a . offset > > PAGE_SHIFT ) ;
}
# endif /* __ARCH_WANT_SYS_OLD_MMAP */
2005-04-17 02:20:36 +04:00
/*
2009-01-08 15:04:47 +03:00
* split a vma into two pieces at address ' addr ' , a new vma is allocated either
* for the first part or the tail .
2005-04-17 02:20:36 +04:00
*/
2009-01-08 15:04:47 +03:00
int split_vma ( struct mm_struct * mm , struct vm_area_struct * vma ,
unsigned long addr , int new_below )
2005-04-17 02:20:36 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_area_struct * new ;
struct vm_region * region ;
unsigned long npages ;
2005-04-17 02:20:36 +04:00
2010-01-16 04:01:34 +03:00
/* we're only permitted to split anonymous regions (these should have
* only a single usage on the region ) */
if ( vma - > vm_file )
2009-01-08 15:04:47 +03:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
if ( mm - > map_count > = sysctl_max_map_count )
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
region = kmem_cache_alloc ( vm_region_jar , GFP_KERNEL ) ;
if ( ! region )
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
new = kmem_cache_alloc ( vm_area_cachep , GFP_KERNEL ) ;
if ( ! new ) {
kmem_cache_free ( vm_region_jar , region ) ;
return - ENOMEM ;
}
/* most fields are the same, copy all, and then fixup */
* new = * vma ;
* region = * vma - > vm_region ;
new - > vm_region = region ;
npages = ( addr - vma - > vm_start ) > > PAGE_SHIFT ;
if ( new_below ) {
2009-01-08 15:04:47 +03:00
region - > vm_top = region - > vm_end = new - > vm_end = addr ;
2009-01-08 15:04:47 +03:00
} else {
region - > vm_start = new - > vm_start = addr ;
region - > vm_pgoff = new - > vm_pgoff + = npages ;
2005-04-17 02:20:36 +04:00
}
2009-01-08 15:04:47 +03:00
if ( new - > vm_ops & & new - > vm_ops - > open )
new - > vm_ops - > open ( new ) ;
delete_vma_from_mm ( vma ) ;
down_write ( & nommu_region_sem ) ;
delete_nommu_region ( vma - > vm_region ) ;
if ( new_below ) {
vma - > vm_region - > vm_start = vma - > vm_start = addr ;
vma - > vm_region - > vm_pgoff = vma - > vm_pgoff + = npages ;
} else {
vma - > vm_region - > vm_end = vma - > vm_end = addr ;
2009-01-08 15:04:47 +03:00
vma - > vm_region - > vm_top = addr ;
2009-01-08 15:04:47 +03:00
}
add_nommu_region ( vma - > vm_region ) ;
add_nommu_region ( new - > vm_region ) ;
up_write ( & nommu_region_sem ) ;
add_vma_to_mm ( mm , vma ) ;
add_vma_to_mm ( mm , new ) ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* shrink a VMA by removing the specified chunk from either the beginning or
* the end
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
static int shrink_vma ( struct mm_struct * mm ,
struct vm_area_struct * vma ,
unsigned long from , unsigned long to )
2005-04-17 02:20:36 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_region * region ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/* adjust the VMA's pointers, which may reposition it in the MM's tree
* and list */
delete_vma_from_mm ( vma ) ;
if ( from > vma - > vm_start )
vma - > vm_end = from ;
else
vma - > vm_start = to ;
add_vma_to_mm ( mm , vma ) ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/* cut the backing region down to size */
region = vma - > vm_region ;
2010-01-16 04:01:33 +03:00
BUG_ON ( region - > vm_usage ! = 1 ) ;
2009-01-08 15:04:47 +03:00
down_write ( & nommu_region_sem ) ;
delete_nommu_region ( region ) ;
2009-01-08 15:04:47 +03:00
if ( from > region - > vm_start ) {
to = region - > vm_top ;
region - > vm_top = region - > vm_end = from ;
} else {
2009-01-08 15:04:47 +03:00
region - > vm_start = to ;
2009-01-08 15:04:47 +03:00
}
2009-01-08 15:04:47 +03:00
add_nommu_region ( region ) ;
up_write ( & nommu_region_sem ) ;
free_page_series ( from , to ) ;
return 0 ;
}
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/*
* release a mapping
* - under NOMMU conditions the chunk to be unmapped must be backed by a single
* VMA , though it need not cover the whole VMA
*/
int do_munmap ( struct mm_struct * mm , unsigned long start , size_t len )
{
struct vm_area_struct * vma ;
2011-05-25 04:12:56 +04:00
unsigned long end ;
2009-01-08 15:04:47 +03:00
int ret ;
2005-04-17 02:20:36 +04:00
2011-05-25 04:12:56 +04:00
len = PAGE_ALIGN ( len ) ;
2009-01-08 15:04:47 +03:00
if ( len = = 0 )
return - EINVAL ;
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
2011-05-25 04:12:56 +04:00
end = start + len ;
2009-01-08 15:04:47 +03:00
/* find the first potentially overlapping VMA */
vma = find_vma ( mm , start ) ;
if ( ! vma ) {
2014-04-08 02:37:36 +04:00
static int limit ;
2009-04-03 03:56:32 +04:00
if ( limit < 5 ) {
2015-06-25 02:57:47 +03:00
pr_warn ( " munmap of memory not mmapped by process %d (%s): 0x%lx-0x%lx \n " ,
current - > pid , current - > comm ,
start , start + len - 1 ) ;
2009-04-03 03:56:32 +04:00
limit + + ;
}
2009-01-08 15:04:47 +03:00
return - EINVAL ;
}
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
/* we're allowed to split an anonymous VMA but not a file-backed one */
if ( vma - > vm_file ) {
do {
2015-06-25 02:57:47 +03:00
if ( start > vma - > vm_start )
2009-01-08 15:04:47 +03:00
return - EINVAL ;
if ( end = = vma - > vm_end )
goto erase_whole_vma ;
2011-05-25 04:11:25 +04:00
vma = vma - > vm_next ;
} while ( vma ) ;
2009-01-08 15:04:47 +03:00
return - EINVAL ;
} else {
/* the chunk must be a subset of the VMA found */
if ( start = = vma - > vm_start & & end = = vma - > vm_end )
goto erase_whole_vma ;
2015-06-25 02:57:47 +03:00
if ( start < vma - > vm_start | | end > vma - > vm_end )
2009-01-08 15:04:47 +03:00
return - EINVAL ;
2015-11-06 05:46:35 +03:00
if ( offset_in_page ( start ) )
2009-01-08 15:04:47 +03:00
return - EINVAL ;
2015-11-06 05:46:35 +03:00
if ( end ! = vma - > vm_end & & offset_in_page ( end ) )
2009-01-08 15:04:47 +03:00
return - EINVAL ;
if ( start ! = vma - > vm_start & & end ! = vma - > vm_end ) {
ret = split_vma ( mm , vma , start , 1 ) ;
2015-06-25 02:57:47 +03:00
if ( ret < 0 )
2009-01-08 15:04:47 +03:00
return ret ;
}
return shrink_vma ( mm , vma , start , end ) ;
}
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
erase_whole_vma :
delete_vma_from_mm ( vma ) ;
delete_vma ( mm , vma ) ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( do_munmap ) ;
2005-04-17 02:20:36 +04:00
2012-04-21 05:57:04 +04:00
int vm_munmap ( unsigned long addr , size_t len )
2006-09-27 12:50:20 +04:00
{
2012-04-21 05:57:04 +04:00
struct mm_struct * mm = current - > mm ;
2006-09-27 12:50:20 +04:00
int ret ;
down_write ( & mm - > mmap_sem ) ;
ret = do_munmap ( mm , addr , len ) ;
up_write ( & mm - > mmap_sem ) ;
return ret ;
}
2012-04-21 03:20:01 +04:00
EXPORT_SYMBOL ( vm_munmap ) ;
SYSCALL_DEFINE2 ( munmap , unsigned long , addr , size_t , len )
{
2012-04-21 05:57:04 +04:00
return vm_munmap ( addr , len ) ;
2012-04-21 03:20:01 +04:00
}
2006-09-27 12:50:20 +04:00
/*
2009-01-08 15:04:47 +03:00
* release all the mappings made in a process ' s VM space
2006-09-27 12:50:20 +04:00
*/
2009-01-08 15:04:47 +03:00
void exit_mmap ( struct mm_struct * mm )
2005-04-17 02:20:36 +04:00
{
2009-01-08 15:04:47 +03:00
struct vm_area_struct * vma ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
if ( ! mm )
return ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
mm - > total_vm = 0 ;
2005-04-17 02:20:36 +04:00
2009-01-08 15:04:47 +03:00
while ( ( vma = mm - > mmap ) ) {
mm - > mmap = vma - > vm_next ;
delete_vma_from_mm ( vma ) ;
delete_vma ( mm , vma ) ;
2010-11-24 23:56:54 +03:00
cond_resched ( ) ;
2005-04-17 02:20:36 +04:00
}
}
2012-04-21 02:35:40 +04:00
unsigned long vm_brk ( unsigned long addr , unsigned long len )
2005-04-17 02:20:36 +04:00
{
return - ENOMEM ;
}
/*
2006-09-27 12:50:21 +04:00
* expand ( or shrink ) an existing mapping , potentially moving it at the same
* time ( controlled by the MREMAP_MAYMOVE flag and available VM space )
2005-04-17 02:20:36 +04:00
*
2006-09-27 12:50:21 +04:00
* under NOMMU conditions , we only permit changing a mapping ' s size , and only
2009-01-08 15:04:47 +03:00
* as long as it stays within the region allocated by do_mmap_private ( ) and the
* block is not shareable
2005-04-17 02:20:36 +04:00
*
2006-09-27 12:50:21 +04:00
* MREMAP_FIXED is not supported under NOMMU conditions
2005-04-17 02:20:36 +04:00
*/
2013-03-04 19:47:59 +04:00
static unsigned long do_mremap ( unsigned long addr ,
2005-04-17 02:20:36 +04:00
unsigned long old_len , unsigned long new_len ,
unsigned long flags , unsigned long new_addr )
{
2006-09-27 12:50:21 +04:00
struct vm_area_struct * vma ;
2005-04-17 02:20:36 +04:00
/* insanity checks first */
2011-05-25 04:12:56 +04:00
old_len = PAGE_ALIGN ( old_len ) ;
new_len = PAGE_ALIGN ( new_len ) ;
2009-01-08 15:04:47 +03:00
if ( old_len = = 0 | | new_len = = 0 )
2005-04-17 02:20:36 +04:00
return ( unsigned long ) - EINVAL ;
2015-11-06 05:46:35 +03:00
if ( offset_in_page ( addr ) )
2009-01-08 15:04:47 +03:00
return - EINVAL ;
2005-04-17 02:20:36 +04:00
if ( flags & MREMAP_FIXED & & new_addr ! = addr )
return ( unsigned long ) - EINVAL ;
2009-01-08 15:04:47 +03:00
vma = find_vma_exact ( current - > mm , addr , old_len ) ;
2006-09-27 12:50:21 +04:00
if ( ! vma )
return ( unsigned long ) - EINVAL ;
2005-04-17 02:20:36 +04:00
2006-09-27 12:50:21 +04:00
if ( vma - > vm_end ! = vma - > vm_start + old_len )
2005-04-17 02:20:36 +04:00
return ( unsigned long ) - EFAULT ;
2006-09-27 12:50:21 +04:00
if ( vma - > vm_flags & VM_MAYSHARE )
2005-04-17 02:20:36 +04:00
return ( unsigned long ) - EPERM ;
2009-01-08 15:04:47 +03:00
if ( new_len > vma - > vm_region - > vm_end - vma - > vm_region - > vm_start )
2005-04-17 02:20:36 +04:00
return ( unsigned long ) - ENOMEM ;
/* all checks complete - do it */
2006-09-27 12:50:21 +04:00
vma - > vm_end = vma - > vm_start + new_len ;
return vma - > vm_start ;
}
2009-01-14 16:14:15 +03:00
SYSCALL_DEFINE5 ( mremap , unsigned long , addr , unsigned long , old_len ,
unsigned long , new_len , unsigned long , flags ,
unsigned long , new_addr )
2006-09-27 12:50:21 +04:00
{
unsigned long ret ;
down_write ( & current - > mm - > mmap_sem ) ;
ret = do_mremap ( addr , old_len , new_len , flags , new_addr ) ;
up_write ( & current - > mm - > mmap_sem ) ;
return ret ;
2005-04-17 02:20:36 +04:00
}
2013-02-23 04:35:56 +04:00
struct page * follow_page_mask ( struct vm_area_struct * vma ,
unsigned long address , unsigned int flags ,
unsigned int * page_mask )
2005-04-17 02:20:36 +04:00
{
2013-02-23 04:35:56 +04:00
* page_mask = 0 ;
2005-04-17 02:20:36 +04:00
return NULL ;
}
2011-07-09 02:39:46 +04:00
int remap_pfn_range ( struct vm_area_struct * vma , unsigned long addr ,
unsigned long pfn , unsigned long size , pgprot_t prot )
2005-04-17 02:20:36 +04:00
{
2011-07-09 02:39:46 +04:00
if ( addr ! = ( pfn < < PAGE_SHIFT ) )
return - EINVAL ;
mm: kill vma flag VM_RESERVED and mm->reserved_vm counter
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:
| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.
Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:29:02 +04:00
vma - > vm_flags | = VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP ;
2005-09-12 05:18:10 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2006-07-14 11:24:09 +04:00
EXPORT_SYMBOL ( remap_pfn_range ) ;
2005-04-17 02:20:36 +04:00
2013-04-28 00:25:38 +04:00
int vm_iomap_memory ( struct vm_area_struct * vma , phys_addr_t start , unsigned long len )
{
unsigned long pfn = start > > PAGE_SHIFT ;
unsigned long vm_len = vma - > vm_end - vma - > vm_start ;
pfn + = vma - > vm_pgoff ;
return io_remap_pfn_range ( vma , vma - > vm_start , pfn , vm_len , vma - > vm_page_prot ) ;
}
EXPORT_SYMBOL ( vm_iomap_memory ) ;
2008-02-05 09:29:59 +03:00
int remap_vmalloc_range ( struct vm_area_struct * vma , void * addr ,
unsigned long pgoff )
{
unsigned int size = vma - > vm_end - vma - > vm_start ;
if ( ! ( vma - > vm_flags & VM_USERMAP ) )
return - EINVAL ;
vma - > vm_start = ( unsigned long ) ( addr + ( pgoff < < PAGE_SHIFT ) ) ;
vma - > vm_end = vma - > vm_start + size ;
return 0 ;
}
EXPORT_SYMBOL ( remap_vmalloc_range ) ;
2005-04-17 02:20:36 +04:00
unsigned long arch_get_unmapped_area ( struct file * file , unsigned long addr ,
unsigned long len , unsigned long pgoff , unsigned long flags )
{
return - ENOMEM ;
}
void unmap_mapping_range ( struct address_space * mapping ,
loff_t const holebegin , loff_t const holelen ,
int even_cows )
{
}
2006-07-14 11:24:09 +04:00
EXPORT_SYMBOL ( unmap_mapping_range ) ;
2005-04-17 02:20:36 +04:00
/*
* Check that a process has enough memory to allocate a new virtual
* mapping . 0 means there is enough memory for the allocation to
* succeed and - ENOMEM implies there is not .
*
* We currently support three overcommit policies , which are set via the
* vm . overcommit_memory sysctl . See Documentation / vm / overcommit - accounting
*
* Strict overcommit modes added 2002 Feb 26 by Alan Cox .
* Additional code 2002 Jul 20 by Robert Love .
*
* cap_sys_admin is 1 if the process has admin privileges , 0 otherwise .
*
* Note this is a helper function intended to be used by LSMs which
* wish to use this logic .
*/
2007-08-23 01:01:28 +04:00
int __vm_enough_memory ( struct mm_struct * mm , long pages , int cap_sys_admin )
2005-04-17 02:20:36 +04:00
{
2015-02-12 02:28:42 +03:00
long free , allowed , reserve ;
2005-04-17 02:20:36 +04:00
vm_acct_memory ( pages ) ;
/*
* Sometimes we want to use more memory than we have
*/
if ( sysctl_overcommit_memory = = OVERCOMMIT_ALWAYS )
return 0 ;
if ( sysctl_overcommit_memory = = OVERCOMMIT_GUESS ) {
2011-07-26 04:12:19 +04:00
free = global_page_state ( NR_FREE_PAGES ) ;
free + = global_page_state ( NR_FILE_PAGES ) ;
/*
* shmem pages shouldn ' t be counted as free in this
* case , they can ' t be purged , only swapped out , and
* that won ' t affect the overall amount of available
* memory in the system .
*/
free - = global_page_state ( NR_SHMEM ) ;
2005-04-17 02:20:36 +04:00
swap: add per-partition lock for swapfile
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD). The main contention comes
from swap_info_get(). This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock. In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages. But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it. read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list. To delete that code, we add a
new highest_priority_index. Whenever get_swap_page() is called, we
check it. If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock(). But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush). And
BTW, looks get_swap_page() doesn't really need the lock. We never free
swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s. This patch further improves the
speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.
[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 04:34:38 +04:00
free + = get_nr_swap_pages ( ) ;
2005-04-17 02:20:36 +04:00
/*
* Any slabs which are created with the
* SLAB_RECLAIM_ACCOUNT flag claim to have contents
* which are reclaimable , under pressure . The dentry
* cache and most inode caches should fall into this
*/
2006-09-26 10:31:51 +04:00
free + = global_page_state ( NR_SLAB_RECLAIMABLE ) ;
2005-04-17 02:20:36 +04:00
2006-04-11 09:53:01 +04:00
/*
* Leave reserved pages . The pages are not for anonymous pages .
*/
2011-07-26 04:12:19 +04:00
if ( free < = totalreserve_pages )
2006-04-11 09:53:01 +04:00
goto error ;
else
2011-07-26 04:12:19 +04:00
free - = totalreserve_pages ;
2006-04-11 09:53:01 +04:00
/*
2013-04-30 02:08:11 +04:00
* Reserve some for root
2006-04-11 09:53:01 +04:00
*/
2005-04-17 02:20:36 +04:00
if ( ! cap_sys_admin )
2013-04-30 02:08:11 +04:00
free - = sysctl_admin_reserve_kbytes > > ( PAGE_SHIFT - 10 ) ;
2005-04-17 02:20:36 +04:00
if ( free > pages )
return 0 ;
2006-04-11 09:53:01 +04:00
goto error ;
2005-04-17 02:20:36 +04:00
}
2013-11-13 03:08:31 +04:00
allowed = vm_commit_limit ( ) ;
2005-04-17 02:20:36 +04:00
/*
2013-04-30 02:08:11 +04:00
* Reserve some 3 % for root
2005-04-17 02:20:36 +04:00
*/
if ( ! cap_sys_admin )
2013-04-30 02:08:11 +04:00
allowed - = sysctl_admin_reserve_kbytes > > ( PAGE_SHIFT - 10 ) ;
2005-04-17 02:20:36 +04:00
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
/*
* Don ' t let a single process grow so big a user can ' t recover
*/
if ( mm ) {
reserve = sysctl_user_reserve_kbytes > > ( PAGE_SHIFT - 10 ) ;
2015-02-12 02:28:42 +03:00
allowed - = min_t ( long , mm - > total_vm / 32 , reserve ) ;
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
}
2005-04-17 02:20:36 +04:00
2009-05-01 02:08:51 +04:00
if ( percpu_counter_read_positive ( & vm_committed_as ) < allowed )
2005-04-17 02:20:36 +04:00
return 0 ;
2009-05-01 02:08:51 +04:00
2006-04-11 09:53:01 +04:00
error :
2005-04-17 02:20:36 +04:00
vm_unacct_memory ( pages ) ;
return - ENOMEM ;
}
2007-07-19 12:47:03 +04:00
int filemap_fault ( struct vm_area_struct * vma , struct vm_fault * vmf )
2006-01-06 11:11:42 +03:00
{
BUG ( ) ;
2007-07-19 12:47:03 +04:00
return 0 ;
2006-01-06 11:11:42 +03:00
}
2007-07-21 15:37:25 +04:00
EXPORT_SYMBOL ( filemap_fault ) ;
2006-09-27 12:50:15 +04:00
2014-04-08 02:37:19 +04:00
void filemap_map_pages ( struct vm_area_struct * vma , struct vm_fault * vmf )
{
BUG ( ) ;
}
EXPORT_SYMBOL ( filemap_map_pages ) ;
2011-03-29 17:05:12 +04:00
static int __access_remote_vm ( struct task_struct * tsk , struct mm_struct * mm ,
unsigned long addr , void * buf , int len , int write )
2006-09-27 12:50:15 +04:00
{
struct vm_area_struct * vma ;
down_read ( & mm - > mmap_sem ) ;
/* the access must start within one of the target process's mappings */
2006-09-27 12:50:16 +04:00
vma = find_vma ( mm , addr ) ;
if ( vma ) {
2006-09-27 12:50:15 +04:00
/* don't overrun this mapping */
if ( addr + len > = vma - > vm_end )
len = vma - > vm_end - addr ;
/* only read or write mappings where it is permitted */
2006-09-27 12:50:19 +04:00
if ( write & & vma - > vm_flags & VM_MAYWRITE )
2010-01-06 20:23:28 +03:00
copy_to_user_page ( vma , NULL , addr ,
( void * ) addr , buf , len ) ;
2006-09-27 12:50:19 +04:00
else if ( ! write & & vma - > vm_flags & VM_MAYREAD )
2010-01-06 20:23:28 +03:00
copy_from_user_page ( vma , NULL , addr ,
buf , ( void * ) addr , len ) ;
2006-09-27 12:50:15 +04:00
else
len = 0 ;
} else {
len = 0 ;
}
up_read ( & mm - > mmap_sem ) ;
2011-03-29 17:05:12 +04:00
return len ;
}
/**
* @ access_remote_vm - access another process ' address space
* @ mm : the mm_struct of the target address space
* @ addr : start address to access
* @ buf : source or destination buffer
* @ len : number of bytes to transfer
* @ write : whether the access is a write
*
* The caller must hold a reference on @ mm .
*/
int access_remote_vm ( struct mm_struct * mm , unsigned long addr ,
void * buf , int len , int write )
{
return __access_remote_vm ( NULL , mm , addr , buf , len , write ) ;
}
/*
* Access another process ' address space .
* - source / target buffer must be kernel space
*/
int access_process_vm ( struct task_struct * tsk , unsigned long addr , void * buf , int len , int write )
{
struct mm_struct * mm ;
if ( addr + len < addr )
return 0 ;
mm = get_task_mm ( tsk ) ;
if ( ! mm )
return 0 ;
len = __access_remote_vm ( tsk , mm , addr , buf , len , write ) ;
2006-09-27 12:50:15 +04:00
mmput ( mm ) ;
return len ;
}
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
/**
* nommu_shrink_inode_mappings - Shrink the shared mappings on an inode
* @ inode : The inode to check
* @ size : The current filesize of the inode
* @ newsize : The proposed filesize of the inode
*
* Check the shared mappings on an inode on behalf of a shrinking truncate to
* make sure that that any outstanding VMAs aren ' t broken and then shrink the
* vm_regions that extend that beyond so that do_mmap_pgoff ( ) doesn ' t
* automatically grant mappings that are too large .
*/
int nommu_shrink_inode_mappings ( struct inode * inode , size_t size ,
size_t newsize )
{
struct vm_area_struct * vma ;
struct vm_region * region ;
pgoff_t low , high ;
size_t r_size , r_top ;
low = newsize > > PAGE_SHIFT ;
high = ( size + PAGE_SIZE - 1 ) > > PAGE_SHIFT ;
down_write ( & nommu_region_sem ) ;
2014-12-13 03:54:39 +03:00
i_mmap_lock_read ( inode - > i_mapping ) ;
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
/* search for VMAs that fall within the dead zone */
2012-10-09 03:31:25 +04:00
vma_interval_tree_foreach ( vma , & inode - > i_mapping - > i_mmap , low , high ) {
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
/* found one - only interested if it's shared out of the page
* cache */
if ( vma - > vm_flags & VM_SHARED ) {
2014-12-13 03:54:39 +03:00
i_mmap_unlock_read ( inode - > i_mapping ) ;
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
up_write ( & nommu_region_sem ) ;
return - ETXTBSY ; /* not quite true, but near enough */
}
}
/* reduce any regions that overlap the dead zone - if in existence,
* these will be pointed to by VMAs that don ' t overlap the dead zone
*
* we don ' t check for any regions that start beyond the EOF as there
* shouldn ' t be any
*/
2014-12-13 03:54:39 +03:00
vma_interval_tree_foreach ( vma , & inode - > i_mapping - > i_mmap , 0 , ULONG_MAX ) {
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
if ( ! ( vma - > vm_flags & VM_SHARED ) )
continue ;
region = vma - > vm_region ;
r_size = region - > vm_top - region - > vm_start ;
r_top = ( region - > vm_pgoff < < PAGE_SHIFT ) + r_size ;
if ( r_top > newsize ) {
region - > vm_top - = r_top - newsize ;
if ( region - > vm_end > region - > vm_top )
region - > vm_end = region - > vm_top ;
}
}
2014-12-13 03:54:39 +03:00
i_mmap_unlock_read ( inode - > i_mapping ) ;
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 04:01:39 +03:00
up_write ( & nommu_region_sem ) ;
return 0 ;
}
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
/*
* Initialise sysctl_user_reserve_kbytes .
*
* This is intended to prevent a user from starting a single memory hogging
* process , such that they cannot recover ( kill the hog ) in OVERCOMMIT_NEVER
* mode .
*
* The default value is min ( 3 % of free memory , 128 MB )
* 128 MB is enough to recover with sshd / login , bash , and top / kill .
*/
static int __meminit init_user_reserve ( void )
{
unsigned long free_kbytes ;
free_kbytes = global_page_state ( NR_FREE_PAGES ) < < ( PAGE_SHIFT - 10 ) ;
sysctl_user_reserve_kbytes = min ( free_kbytes / 32 , 1UL < < 17 ) ;
return 0 ;
}
2015-05-02 03:08:20 +03:00
subsys_initcall ( init_user_reserve ) ;
2013-04-30 02:08:11 +04:00
/*
* Initialise sysctl_admin_reserve_kbytes .
*
* The purpose of sysctl_admin_reserve_kbytes is to allow the sys admin
* to log in and kill a memory hogging process .
*
* Systems with more than 256 MB will reserve 8 MB , enough to recover
* with sshd , bash , and top in OVERCOMMIT_GUESS . Smaller systems will
* only reserve 3 % of free pages by default .
*/
static int __meminit init_admin_reserve ( void )
{
unsigned long free_kbytes ;
free_kbytes = global_page_state ( NR_FREE_PAGES ) < < ( PAGE_SHIFT - 10 ) ;
sysctl_admin_reserve_kbytes = min ( free_kbytes / 32 , 1UL < < 13 ) ;
return 0 ;
}
2015-05-02 03:08:20 +03:00
subsys_initcall ( init_admin_reserve ) ;