2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
* mm / mmap . c
*
* Written by obz .
*
2009-01-05 17:06:29 +03:00
* Address space accounting code < alan @ lxorguk . ukuu . org . uk >
2005-04-17 02:20:36 +04:00
*/
2014-06-07 01:38:30 +04:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2013-04-30 02:08:33 +04:00
# include <linux/kernel.h>
2005-04-17 02:20:36 +04:00
# include <linux/slab.h>
2007-10-17 10:29:23 +04:00
# include <linux/backing-dev.h>
2005-04-17 02:20:36 +04:00
# include <linux/mm.h>
2022-01-15 01:06:07 +03:00
# include <linux/mm_inline.h>
2005-04-17 02:20:36 +04:00
# include <linux/shm.h>
# include <linux/mman.h>
# include <linux/pagemap.h>
# include <linux/swap.h>
# include <linux/syscalls.h>
2006-01-11 23:17:46 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/init.h>
# include <linux/file.h>
# include <linux/fs.h>
# include <linux/personality.h>
# include <linux/security.h>
# include <linux/hugetlb.h>
2016-07-27 01:26:15 +03:00
# include <linux/shmem_fs.h>
2005-04-17 02:20:36 +04:00
# include <linux/profile.h>
2011-10-16 10:01:52 +04:00
# include <linux/export.h>
2005-04-17 02:20:36 +04:00
# include <linux/mount.h>
# include <linux/mempolicy.h>
# include <linux/rmap.h>
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 02:46:29 +04:00
# include <linux/mmu_notifier.h>
2014-08-07 03:06:36 +04:00
# include <linux/mmdebug.h>
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
# include <linux/perf_event.h>
2010-10-30 10:54:44 +04:00
# include <linux/audit.h>
2011-01-14 02:46:59 +03:00
# include <linux/khugepaged.h>
uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints
Add uprobes support to the core kernel, with x86 support.
This commit adds the kernel facilities, the actual uprobes
user-space ABI and perf probe support comes in later commits.
General design:
Uprobes are maintained in an rb-tree indexed by inode and offset
(the offset here is from the start of the mapping). For a unique
(inode, offset) tuple, there can be at most one uprobe in the
rb-tree.
Since the (inode, offset) tuple identifies a unique uprobe, more
than one user may be interested in the same uprobe. This provides
the ability to connect multiple 'consumers' to the same uprobe.
Each consumer defines a handler and a filter (optional). The
'handler' is run every time the uprobe is hit, if it matches the
'filter' criteria.
The first consumer of a uprobe causes the breakpoint to be
inserted at the specified address and subsequent consumers are
appended to this list. On subsequent probes, the consumer gets
appended to the existing list of consumers. The breakpoint is
removed when the last consumer unregisters. For all other
unregisterations, the consumer is removed from the list of
consumers.
Given a inode, we get a list of the mms that have mapped the
inode. Do the actual registration if mm maps the page where a
probe needs to be inserted/removed.
We use a temporary list to walk through the vmas that map the
inode.
- The number of maps that map the inode, is not known before we
walk the rmap and keeps changing.
- extending vm_area_struct wasn't recommended, it's a
size-critical data structure.
- There can be more than one maps of the inode in the same mm.
We add callbacks to the mmap methods to keep an eye on text vmas
that are of interest to uprobes. When a vma of interest is mapped,
we insert the breakpoint at the right address.
Uprobe works by replacing the instruction at the address defined
by (inode, offset) with the arch specific breakpoint
instruction. We save a copy of the original instruction at the
uprobed address.
This is needed for:
a. executing the instruction out-of-line (xol).
b. instruction analysis for any subsequent fixups.
c. restoring the instruction back when the uprobe is unregistered.
We insert or delete a breakpoint instruction, and this
breakpoint instruction is assumed to be the smallest instruction
available on the platform. For fixed size instruction platforms
this is trivially true, for variable size instruction platforms
the breakpoint instruction is typically the smallest (often a
single byte).
Writing the instruction is done by COWing the page and changing
the instruction during the copy, this even though most platforms
allow atomic writes of the breakpoint instruction. This also
mirrors the behaviour of a ptrace() memory write to a PRIVATE
file map.
The core worker is derived from KSM's replace_page() logic.
In essence, similar to KSM:
a. allocate a new page and copy over contents of the page that
has the uprobed vaddr
b. modify the copy and insert the breakpoint at the required
address
c. switch the original page with the copy containing the
breakpoint
d. flush page tables.
replace_page() is being replicated here because of some minor
changes in the type of pages and also because Hugh Dickins had
plans to improve replace_page() for KSM specific work.
Instruction analysis on x86 is based on instruction decoder and
determines if an instruction can be probed and determines the
necessary fixups after singlestep. Instruction analysis is done
at probe insertion time so that we avoid having to repeat the
same analysis every time a probe is hit.
A lot of code here is due to the improvement/suggestions/inputs
from Peter Zijlstra.
Changelog:
(v10):
- Add code to clear REX.B prefix as suggested by Denys Vlasenko
and Masami Hiramatsu.
(v9):
- Use insn_offset_modrm as suggested by Masami Hiramatsu.
(v7):
Handle comments from Peter Zijlstra:
- Dont take reference to inode. (expect inode to uprobe_register to be sane).
- Use PTR_ERR to set the return value.
- No need to take reference to inode.
- use PTR_ERR to return error value.
- register and uprobe_unregister share code.
(v5):
- Modified del_consumer as per comments from Peter.
- Drop reference to inode before dropping reference to uprobe.
- Use i_size_read(inode) instead of inode->i_size.
- Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
- Includes errno.h as recommended by Stephen Rothwell to fix a build issue
on sparc defconfig
- Remove restrictions while unregistering.
- Earlier code leaked inode references under some conditions while
registering/unregistering.
- Continue the vma-rmap walk even if the intermediate vma doesnt
meet the requirements.
- Validate the vma found by find_vma before inserting/removing the
breakpoint
- Call del_consumer under mutex_lock.
- Use hash locks.
- Handle mremap.
- Introduce find_least_offset_node() instead of close match logic in
find_uprobe
- Uprobes no more depends on MM_OWNER; No reference to task_structs
while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
have valid content.
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Also-written-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
[ Made various small edits to the commit log ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-09 13:26:42 +04:00
# include <linux/uprobes.h>
2013-04-30 02:08:12 +04:00
# include <linux/notifier.h>
# include <linux/memory.h>
2014-06-07 01:38:30 +04:00
# include <linux/printk.h>
2015-09-05 01:46:24 +03:00
# include <linux/userfaultfd_k.h>
2016-02-03 03:57:43 +03:00
# include <linux/moduleparam.h>
mm/core, x86/mm/pkeys: Add execute-only protection keys support
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.
This patch uses protection keys to set up mappings to do just that.
If a user calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.
I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.
This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.
But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.
The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.
Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.
PKRU *is* a user register and the kernel is modifying it. That
means that code doing:
pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);
could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-13 00:02:40 +03:00
# include <linux/pkeys.h>
2017-09-07 02:25:00 +03:00
# include <linux/oom.h>
2019-04-19 03:50:52 +03:00
# include <linux/sched/mm.h>
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 08:13:40 +03:00
# include <linux/ksm.h>
2005-04-17 02:20:36 +04:00
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <asm/cacheflush.h>
# include <asm/tlb.h>
2007-05-02 21:27:14 +04:00
# include <asm/mmu_context.h>
2005-04-17 02:20:36 +04:00
2020-04-02 07:09:13 +03:00
# define CREATE_TRACE_POINTS
# include <trace/events/mmap.h>
2008-07-24 08:27:10 +04:00
# include "internal.h"
2006-09-07 14:17:04 +04:00
# ifndef arch_mmap_check
# define arch_mmap_check(addr, len, flags) (0)
# endif
mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 02:19:53 +03:00
# ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
const int mmap_rnd_bits_min = CONFIG_ARCH_MMAP_RND_BITS_MIN ;
const int mmap_rnd_bits_max = CONFIG_ARCH_MMAP_RND_BITS_MAX ;
int mmap_rnd_bits __read_mostly = CONFIG_ARCH_MMAP_RND_BITS ;
# endif
# ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
const int mmap_rnd_compat_bits_min = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN ;
const int mmap_rnd_compat_bits_max = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX ;
int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS ;
# endif
2016-05-21 02:57:45 +03:00
static bool ignore_rlimit_data ;
2016-02-03 03:57:43 +03:00
core_param ( ignore_rlimit_data , ignore_rlimit_data , bool , 0644 ) ;
mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 02:19:53 +03:00
2022-09-06 22:49:06 +03:00
static void unmap_region ( struct mm_struct * mm , struct maple_tree * mt ,
[PATCH] freepgt: free_pgtables use vma list
Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
clear_page_range regression since 2.6.10's clear_page_tables; and its
long-standing well-known inefficiency in searching throughout the higher-level
page tables for those few entries to clear and free: all can be blamed on
ignoring the list of vmas when we free page tables.
Replace exit_mmap's clear_page_range of the total user address space by
free_pgtables operating on the mm's vma list; unmap_region use it in the same
way, giving floor and ceiling beyond which it may not free tables. This
brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
in which case latency fixes spoil unmap_vmas throughput).
Beware: the do_mmap_pgoff driver failure case must now use unmap_region
instead of zap_page_range, since a page table might have been allocated, and
can only be freed while it is touched by some vma.
Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
from the clear_page_range levels. (Most of free_pgtables' old code was
actually for a non-existent case, prev not properly set up, dating from before
hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
might want to add latency lockdrops later; but no attempt to do so yet, going
by vma should itself reduce latency.
But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
examination: put that off until a later patch of the series.
What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?
And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
we need to do more than is done here - every PMD_SIZE ever occupied will be
flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
A shame to complicate it unnecessarily.
Special thanks to David Miller for time spent repairing my ceilings.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-20 00:29:15 +04:00
struct vm_area_struct * vma , struct vm_area_struct * prev ,
2022-09-06 22:49:06 +03:00
struct vm_area_struct * next , unsigned long start ,
2023-01-26 22:37:51 +03:00
unsigned long end , bool mm_wr_locked ) ;
[PATCH] freepgt: free_pgtables use vma list
Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
clear_page_range regression since 2.6.10's clear_page_tables; and its
long-standing well-known inefficiency in searching throughout the higher-level
page tables for those few entries to clear and free: all can be blamed on
ignoring the list of vmas when we free page tables.
Replace exit_mmap's clear_page_range of the total user address space by
free_pgtables operating on the mm's vma list; unmap_region use it in the same
way, giving floor and ceiling beyond which it may not free tables. This
brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
in which case latency fixes spoil unmap_vmas throughput).
Beware: the do_mmap_pgoff driver failure case must now use unmap_region
instead of zap_page_range, since a page table might have been allocated, and
can only be freed while it is touched by some vma.
Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
from the clear_page_range levels. (Most of free_pgtables' old code was
actually for a non-existent case, prev not properly set up, dating from before
hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
might want to add latency lockdrops later; but no attempt to do so yet, going
by vma should itself reduce latency.
But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
examination: put that off until a later patch of the series.
What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?
And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
we need to do more than is done here - every PMD_SIZE ever occupied will be
flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
A shame to complicate it unnecessarily.
Special thanks to David Miller for time spent repairing my ceilings.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-20 00:29:15 +04:00
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
static pgprot_t vm_pgprot_modify ( pgprot_t oldprot , unsigned long vm_flags )
{
return pgprot_modify ( oldprot , vm_get_page_prot ( vm_flags ) ) ;
}
/* Update vma->vm_page_prot to reflect vma->vm_flags. */
void vma_set_page_prot ( struct vm_area_struct * vma )
{
unsigned long vm_flags = vma - > vm_flags ;
2016-10-08 03:01:22 +03:00
pgprot_t vm_page_prot ;
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
2016-10-08 03:01:22 +03:00
vm_page_prot = vm_pgprot_modify ( vma - > vm_page_prot , vm_flags ) ;
if ( vma_wants_writenotify ( vma , vm_page_prot ) ) {
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
vm_flags & = ~ VM_SHARED ;
2016-10-08 03:01:22 +03:00
vm_page_prot = vm_pgprot_modify ( vm_page_prot , vm_flags ) ;
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
}
2020-06-09 07:33:54 +03:00
/* remove_protection_ptes reads vma->vm_page_prot without mmap_lock */
2016-10-08 03:01:22 +03:00
WRITE_ONCE ( vma - > vm_page_prot , vm_page_prot ) ;
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
}
2005-04-17 02:20:36 +04:00
/*
2014-12-13 03:54:24 +03:00
* Requires inode - > i_mapping - > i_mmap_rwsem
2005-04-17 02:20:36 +04:00
*/
static void __remove_shared_vm_struct ( struct vm_area_struct * vma ,
struct file * file , struct address_space * mapping )
{
if ( vma - > vm_flags & VM_SHARED )
2014-08-09 01:25:25 +04:00
mapping_unmap_writable ( mapping ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_lock ( mapping ) ;
2015-02-11 01:09:59 +03:00
vma_interval_tree_remove ( vma , & mapping - > i_mmap ) ;
2005-04-17 02:20:36 +04:00
flush_dcache_mmap_unlock ( mapping ) ;
}
/*
2012-10-09 03:31:25 +04:00
* Unlink a file - based vm structure from its interval tree , to hide
2005-10-30 04:15:57 +03:00
* vma from rmap and vmtruncate before freeing its page tables .
2005-04-17 02:20:36 +04:00
*/
2005-10-30 04:15:57 +03:00
void unlink_file_vma ( struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
struct file * file = vma - > vm_file ;
if ( file ) {
struct address_space * mapping = file - > f_mapping ;
2014-12-13 03:54:21 +03:00
i_mmap_lock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
__remove_shared_vm_struct ( vma , file , mapping ) ;
2014-12-13 03:54:21 +03:00
i_mmap_unlock_write ( mapping ) ;
2005-04-17 02:20:36 +04:00
}
2005-10-30 04:15:57 +03:00
}
/*
2022-09-06 22:49:06 +03:00
* Close a vm structure and free it .
2005-10-30 04:15:57 +03:00
*/
2023-02-27 20:36:31 +03:00
static void remove_vma ( struct vm_area_struct * vma , bool unreachable )
2005-10-30 04:15:57 +03:00
{
might_sleep ( ) ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_ops & & vma - > vm_ops - > close )
vma - > vm_ops - > close ( vma ) ;
2012-10-09 03:28:54 +04:00
if ( vma - > vm_file )
2005-10-30 04:15:57 +03:00
fput ( vma - > vm_file ) ;
2008-04-28 13:13:08 +04:00
mpol_put ( vma_policy ( vma ) ) ;
2023-02-27 20:36:31 +03:00
if ( unreachable )
__vm_area_free ( vma ) ;
else
vm_area_free ( vma ) ;
2005-04-17 02:20:36 +04:00
}
2023-01-20 19:26:08 +03:00
static inline struct vm_area_struct * vma_prev_limit ( struct vma_iterator * vmi ,
unsigned long min )
{
return mas_prev ( & vmi - > mas , min ) ;
}
static inline int vma_iter_clear_gfp ( struct vma_iterator * vmi ,
unsigned long start , unsigned long end , gfp_t gfp )
{
vmi - > mas . index = start ;
vmi - > mas . last = end - 1 ;
mas_store_gfp ( & vmi - > mas , NULL , gfp ) ;
if ( unlikely ( mas_is_err ( & vmi - > mas ) ) )
return - ENOMEM ;
return 0 ;
}
2022-09-06 22:48:50 +03:00
/*
* check_brk_limits ( ) - Use platform specific check of range & verify mlock
* limits .
* @ addr : The address to check
* @ len : The size of increase .
*
* Return : 0 on success .
*/
static int check_brk_limits ( unsigned long addr , unsigned long len )
{
unsigned long mapped_addr ;
mapped_addr = get_unmapped_area ( NULL , addr , len , 0 , MAP_FIXED ) ;
if ( IS_ERR_VALUE ( mapped_addr ) )
return mapped_addr ;
2023-05-22 23:52:10 +03:00
return mlock_future_ok ( current - > mm , current - > mm - > def_flags , len )
2023-05-22 11:24:12 +03:00
? 0 : - EAGAIN ;
2022-09-06 22:48:50 +03:00
}
2023-01-20 19:26:09 +03:00
static int do_brk_flags ( struct vma_iterator * vmi , struct vm_area_struct * brkvma ,
2022-09-06 22:49:06 +03:00
unsigned long addr , unsigned long request , unsigned long flags ) ;
2009-01-14 16:14:15 +03:00
SYSCALL_DEFINE1 ( brk , unsigned long , brk )
2005-04-17 02:20:36 +04:00
{
2018-10-27 01:08:54 +03:00
unsigned long newbrk , oldbrk , origbrk ;
2005-04-17 02:20:36 +04:00
struct mm_struct * mm = current - > mm ;
2022-09-06 22:48:50 +03:00
struct vm_area_struct * brkvma , * next = NULL ;
2008-06-06 09:46:05 +04:00
unsigned long min_brk ;
2023-06-30 05:28:16 +03:00
bool populate = false ;
2017-02-25 01:58:22 +03:00
LIST_HEAD ( uf ) ;
2023-01-20 19:26:09 +03:00
struct vma_iterator vmi ;
2005-04-17 02:20:36 +04:00
2020-06-09 07:33:25 +03:00
if ( mmap_write_lock_killable ( mm ) )
2016-05-24 02:25:27 +03:00
return - EINTR ;
2005-04-17 02:20:36 +04:00
2018-10-27 01:08:54 +03:00
origbrk = mm - > brk ;
2008-06-06 09:46:05 +04:00
# ifdef CONFIG_COMPAT_BRK
2011-01-14 02:47:23 +03:00
/*
* CONFIG_COMPAT_BRK can still be overridden by setting
* randomize_va_space to 2 , which will still cause mm - > start_brk
* to be arbitrarily shifted
*/
2011-04-15 02:22:09 +04:00
if ( current - > brk_randomized )
2011-01-14 02:47:23 +03:00
min_brk = mm - > start_brk ;
else
min_brk = mm - > end_data ;
2008-06-06 09:46:05 +04:00
# else
min_brk = mm - > start_brk ;
# endif
if ( brk < min_brk )
2005-04-17 02:20:36 +04:00
goto out ;
2006-04-11 09:52:57 +04:00
/*
* Check against rlimit here . If this check is done later after the test
* of oldbrk with newbrk then it can escape the test and let the data
* segment grow beyond its set limit the in case where the limit is
* not page aligned - Ram Gupta
*/
2014-10-10 02:27:32 +04:00
if ( check_data_rlimit ( rlimit ( RLIMIT_DATA ) , brk , mm - > start_brk ,
mm - > end_data , mm - > start_data ) )
2006-04-11 09:52:57 +04:00
goto out ;
2005-04-17 02:20:36 +04:00
newbrk = PAGE_ALIGN ( brk ) ;
oldbrk = PAGE_ALIGN ( mm - > brk ) ;
2018-10-27 01:08:54 +03:00
if ( oldbrk = = newbrk ) {
mm - > brk = brk ;
goto success ;
}
2005-04-17 02:20:36 +04:00
2023-06-30 05:28:16 +03:00
/* Always allow shrinking brk. */
2005-04-17 02:20:36 +04:00
if ( brk < = mm - > brk ) {
2022-09-06 22:48:50 +03:00
/* Search one past newbrk */
2023-01-20 19:26:09 +03:00
vma_iter_init ( & vmi , mm , newbrk ) ;
brkvma = vma_find ( & vmi , oldbrk ) ;
mm: do not BUG_ON missing brk mapping, because userspace can unmap it
The following program will trigger the BUG_ON that this patch removes,
because the user can munmap() mm->brk:
#include <sys/syscall.h>
#include <sys/mman.h>
#include <assert.h>
#include <unistd.h>
static void *brk_now(void)
{
return (void *)syscall(SYS_brk, 0);
}
static void brk_set(void *b)
{
assert(syscall(SYS_brk, b) != -1);
}
int main(int argc, char *argv[])
{
void *b = brk_now();
brk_set(b + 4096);
assert(munmap(b - 4096, 4096 * 2) == 0);
brk_set(b);
return 0;
}
Compile that with musl, since glibc actually uses brk(), and then
execute it, and it'll hit this splat:
kernel BUG at mm/mmap.c:229!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 1379 Comm: a.out Tainted: G S U 6.1.0-rc7+ #419
RIP: 0010:__do_sys_brk+0x2fc/0x340
Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
FS: 0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x400678
Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Instead, just return the old brk value if the original mapping has been
removed.
[akpm@linux-foundation.org: fix changelog, per Liam]
Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-02 19:27:24 +03:00
if ( ! brkvma | | brkvma - > vm_start > = oldbrk )
2022-09-06 22:48:50 +03:00
goto out ; /* mapping intersects with an existing non-brk vma. */
2018-10-27 01:08:54 +03:00
/*
2022-09-06 22:48:50 +03:00
* mm - > brk must be protected by write mmap_lock .
2023-06-30 05:28:16 +03:00
* do_vma_munmap ( ) will drop the lock on success , so update it
2023-01-27 00:20:49 +03:00
* before calling do_vma_munmap ( ) .
2018-10-27 01:08:54 +03:00
*/
mm - > brk = brk ;
2023-06-30 05:28:16 +03:00
if ( do_vma_munmap ( & vmi , brkvma , newbrk , oldbrk , & uf , true ) )
goto out ;
goto success_unlocked ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:48:50 +03:00
if ( check_brk_limits ( oldbrk , newbrk - oldbrk ) )
goto out ;
/*
* Only check if the next VMA is within the stack_guard_gap of the
* expansion area
*/
2023-01-20 19:26:09 +03:00
vma_iter_init ( & vmi , mm , oldbrk ) ;
next = vma_find ( & vmi , newbrk + PAGE_SIZE + stack_guard_gap ) ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
if ( next & & newbrk + PAGE_SIZE > vm_start_gap ( next ) )
2005-04-17 02:20:36 +04:00
goto out ;
2023-01-20 19:26:09 +03:00
brkvma = vma_prev_limit ( & vmi , mm - > start_brk ) ;
2005-04-17 02:20:36 +04:00
/* Ok, looks good - let it rip. */
2023-01-20 19:26:09 +03:00
if ( do_brk_flags ( & vmi , brkvma , oldbrk , newbrk - oldbrk , 0 ) < 0 )
2005-04-17 02:20:36 +04:00
goto out ;
2022-09-06 22:48:50 +03:00
2005-04-17 02:20:36 +04:00
mm - > brk = brk ;
2023-06-30 05:28:16 +03:00
if ( mm - > def_flags & VM_LOCKED )
populate = true ;
2018-10-27 01:08:54 +03:00
success :
2023-06-30 05:28:16 +03:00
mmap_write_unlock ( mm ) ;
success_unlocked :
2017-02-25 01:58:22 +03:00
userfaultfd_unmap_complete ( mm , & uf ) ;
2013-02-23 04:32:40 +04:00
if ( populate )
mm_populate ( oldbrk , newbrk - oldbrk ) ;
return brk ;
2005-04-17 02:20:36 +04:00
out :
2023-06-30 05:28:16 +03:00
mm - > brk = origbrk ;
2020-06-09 07:33:25 +03:00
mmap_write_unlock ( mm ) ;
2021-02-24 23:04:29 +03:00
return origbrk ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:48:45 +03:00
# if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
2014-04-04 01:48:03 +04:00
static void validate_mm ( struct mm_struct * mm )
2005-04-17 02:20:36 +04:00
{
int bug = 0 ;
int i = 0 ;
2022-09-06 22:49:06 +03:00
struct vm_area_struct * vma ;
2023-05-18 17:55:26 +03:00
VMA_ITERATOR ( vmi , mm , 0 ) ;
2022-09-06 22:48:48 +03:00
2023-05-18 17:55:26 +03:00
mt_validate ( & mm - > mm_mt ) ;
for_each_vma ( vmi , vma ) {
2022-09-06 22:48:48 +03:00
# ifdef CONFIG_DEBUG_VM_RB
2016-02-06 02:36:50 +03:00
struct anon_vma * anon_vma = vma - > anon_vma ;
2012-10-09 03:31:45 +04:00
struct anon_vma_chain * avc ;
2023-05-18 17:55:26 +03:00
# endif
unsigned long vmi_start , vmi_end ;
bool warn = 0 ;
2014-10-10 02:28:19 +04:00
2023-05-18 17:55:26 +03:00
vmi_start = vma_iter_addr ( & vmi ) ;
vmi_end = vma_iter_end ( & vmi ) ;
if ( VM_WARN_ON_ONCE_MM ( vma - > vm_end ! = vmi_end , mm ) )
warn = 1 ;
2014-10-10 02:28:19 +04:00
2023-05-18 17:55:26 +03:00
if ( VM_WARN_ON_ONCE_MM ( vma - > vm_start ! = vmi_start , mm ) )
warn = 1 ;
if ( warn ) {
pr_emerg ( " issue in %s \n " , current - > comm ) ;
dump_stack ( ) ;
dump_vma ( vma ) ;
pr_emerg ( " tree range: %px start %lx end %lx \n " , vma ,
vmi_start , vmi_end - 1 ) ;
vma_iter_dump_tree ( & vmi ) ;
}
# ifdef CONFIG_DEBUG_VM_RB
2016-02-06 02:36:50 +03:00
if ( anon_vma ) {
anon_vma_lock_read ( anon_vma ) ;
list_for_each_entry ( avc , & vma - > anon_vma_chain , same_vma )
anon_vma_interval_tree_verify ( avc ) ;
anon_vma_unlock_read ( anon_vma ) ;
}
2022-09-06 22:48:48 +03:00
# endif
2005-04-17 02:20:36 +04:00
i + + ;
}
2012-12-12 04:01:42 +04:00
if ( i ! = mm - > map_count ) {
2023-05-18 17:55:26 +03:00
pr_emerg ( " map_count %d vma iterator %d \n " , mm - > map_count , i ) ;
2012-12-12 04:01:42 +04:00
bug = 1 ;
}
2014-10-10 02:28:39 +04:00
VM_BUG_ON_MM ( bug , mm ) ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:48:48 +03:00
# else /* !CONFIG_DEBUG_VM_MAPLE_TREE */
2005-04-17 02:20:36 +04:00
# define validate_mm(mm) do { } while (0)
2022-09-06 22:48:48 +03:00
# endif /* CONFIG_DEBUG_VM_MAPLE_TREE */
2016-10-08 03:01:37 +03:00
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
/*
* vma has some anon_vma assigned , and is already inserted on that
* anon_vma ' s interval trees .
*
* Before updating the vma ' s vm_start / vm_end / vm_pgoff fields , the
* vma must be removed from the anon_vma ' s interval trees using
* anon_vma_interval_tree_pre_update_vma ( ) .
*
* After the update , the vma will be reinserted using
* anon_vma_interval_tree_post_update_vma ( ) .
*
2020-06-09 07:33:54 +03:00
* The entire update must be protected by exclusive mmap_lock and by
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
* the root anon_vma ' s mutex .
*/
static inline void
anon_vma_interval_tree_pre_update_vma ( struct vm_area_struct * vma )
{
struct anon_vma_chain * avc ;
list_for_each_entry ( avc , & vma - > anon_vma_chain , same_vma )
anon_vma_interval_tree_remove ( avc , & avc - > anon_vma - > rb_root ) ;
}
static inline void
anon_vma_interval_tree_post_update_vma ( struct vm_area_struct * vma )
{
struct anon_vma_chain * avc ;
list_for_each_entry ( avc , & vma - > anon_vma_chain , same_vma )
anon_vma_interval_tree_insert ( avc , & avc - > anon_vma - > rb_root ) ;
}
2013-04-30 02:08:33 +04:00
static unsigned long count_vma_pages_range ( struct mm_struct * mm ,
unsigned long addr , unsigned long end )
{
2022-09-06 22:48:46 +03:00
VMA_ITERATOR ( vmi , mm , addr ) ;
2013-04-30 02:08:33 +04:00
struct vm_area_struct * vma ;
2022-09-06 22:48:46 +03:00
unsigned long nr_pages = 0 ;
2013-04-30 02:08:33 +04:00
2022-09-06 22:48:46 +03:00
for_each_vma_range ( vmi , vma , end ) {
unsigned long vm_start = max ( addr , vma - > vm_start ) ;
unsigned long vm_end = min ( end , vma - > vm_end ) ;
2013-04-30 02:08:33 +04:00
2022-09-06 22:48:46 +03:00
nr_pages + = PHYS_PFN ( vm_end - vm_start ) ;
2013-04-30 02:08:33 +04:00
}
return nr_pages ;
}
2022-09-06 22:49:06 +03:00
static void __vma_link_file ( struct vm_area_struct * vma ,
struct address_space * mapping )
2005-04-17 02:20:36 +04:00
{
2022-09-06 22:49:06 +03:00
if ( vma - > vm_flags & VM_SHARED )
mapping_allow_writable ( mapping ) ;
2005-04-17 02:20:36 +04:00
2022-09-06 22:49:06 +03:00
flush_dcache_mmap_lock ( mapping ) ;
vma_interval_tree_insert ( vma , & mapping - > i_mmap ) ;
flush_dcache_mmap_unlock ( mapping ) ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:49:06 +03:00
static int vma_link ( struct mm_struct * mm , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
2023-01-20 19:26:11 +03:00
VMA_ITERATOR ( vmi , mm , 0 ) ;
2005-04-17 02:20:36 +04:00
struct address_space * mapping = NULL ;
2023-01-20 19:26:11 +03:00
if ( vma_iter_prealloc ( & vmi ) )
2022-09-06 22:48:45 +03:00
return - ENOMEM ;
mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary".
In do_mmap,it will call mmap_region to create vma node, initialize it and
insert it to vma maintain structure in mm_struct and i_mmap tree of the
mapping file, then increase map_count to record the number of vma nodes
used. The hot osq_lock is to protect operations on file's i_mmap tree.
For the mm_struct member change like vma insertion and map_count update,
they do not affect i_mmap tree. Move those operations out of the lock's
critical section, to reduce hold time on the lock.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.5-rc1 due to regression of this benchmark
from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below: Link:
https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 17:57:39 +03:00
vma_iter_store ( & vmi , vma ) ;
2014-06-05 03:07:33 +04:00
if ( vma - > vm_file ) {
2005-04-17 02:20:36 +04:00
mapping = vma - > vm_file - > f_mapping ;
2014-12-13 03:54:21 +03:00
i_mmap_lock_write ( mapping ) ;
2022-09-06 22:49:06 +03:00
__vma_link_file ( vma , mapping ) ;
2014-12-13 03:54:21 +03:00
i_mmap_unlock_write ( mapping ) ;
2022-09-06 22:49:06 +03:00
}
2005-04-17 02:20:36 +04:00
mm - > map_count + + ;
validate_mm ( mm ) ;
2022-09-06 22:48:45 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2023-01-20 19:26:43 +03:00
/*
* init_multi_vma_prep ( ) - Initializer for struct vma_prepare
* @ vp : The vma_prepare struct
* @ vma : The vma that will be altered once locked
* @ next : The next vma if it is to be adjusted
* @ remove : The first vma to be removed
* @ remove2 : The second vma to be removed
*/
static inline void init_multi_vma_prep ( struct vma_prepare * vp ,
struct vm_area_struct * vma , struct vm_area_struct * next ,
struct vm_area_struct * remove , struct vm_area_struct * remove2 )
{
memset ( vp , 0 , sizeof ( struct vma_prepare ) ) ;
vp - > vma = vma ;
vp - > anon_vma = vma - > anon_vma ;
vp - > remove = remove ;
vp - > remove2 = remove2 ;
vp - > adj_next = next ;
if ( ! vp - > anon_vma & & next )
vp - > anon_vma = next - > anon_vma ;
vp - > file = vma - > vm_file ;
if ( vp - > file )
vp - > mapping = vma - > vm_file - > f_mapping ;
}
/*
* init_vma_prep ( ) - Initializer wrapper for vma_prepare struct
* @ vp : The vma_prepare struct
* @ vma : The vma that will be altered once locked
*/
static inline void init_vma_prep ( struct vma_prepare * vp ,
struct vm_area_struct * vma )
{
init_multi_vma_prep ( vp , vma , NULL , NULL , NULL ) ;
}
2023-01-20 19:26:41 +03:00
/*
* vma_prepare ( ) - Helper function for handling locking VMAs prior to altering
* @ vp : The initialized vma_prepare struct
*/
static inline void vma_prepare ( struct vma_prepare * vp )
{
2023-02-27 20:36:15 +03:00
vma_start_write ( vp - > vma ) ;
if ( vp - > adj_next )
vma_start_write ( vp - > adj_next ) ;
/* vp->insert is always a newly created VMA, no need for locking */
if ( vp - > remove )
vma_start_write ( vp - > remove ) ;
if ( vp - > remove2 )
vma_start_write ( vp - > remove2 ) ;
2023-01-20 19:26:41 +03:00
if ( vp - > file ) {
uprobe_munmap ( vp - > vma , vp - > vma - > vm_start , vp - > vma - > vm_end ) ;
if ( vp - > adj_next )
uprobe_munmap ( vp - > adj_next , vp - > adj_next - > vm_start ,
vp - > adj_next - > vm_end ) ;
i_mmap_lock_write ( vp - > mapping ) ;
if ( vp - > insert & & vp - > insert - > vm_file ) {
/*
* Put into interval tree now , so instantiated pages
* are visible to arm / parisc __flush_dcache_page
* throughout ; but we cannot insert into address
* space until vma start or end is updated .
*/
__vma_link_file ( vp - > insert ,
vp - > insert - > vm_file - > f_mapping ) ;
}
}
if ( vp - > anon_vma ) {
anon_vma_lock_write ( vp - > anon_vma ) ;
anon_vma_interval_tree_pre_update_vma ( vp - > vma ) ;
if ( vp - > adj_next )
anon_vma_interval_tree_pre_update_vma ( vp - > adj_next ) ;
}
if ( vp - > file ) {
flush_dcache_mmap_lock ( vp - > mapping ) ;
vma_interval_tree_remove ( vp - > vma , & vp - > mapping - > i_mmap ) ;
if ( vp - > adj_next )
vma_interval_tree_remove ( vp - > adj_next ,
& vp - > mapping - > i_mmap ) ;
}
}
/*
* vma_complete - Helper function for handling the unlocking after altering VMAs ,
* or for inserting a VMA .
*
* @ vp : The vma_prepare struct
* @ vmi : The vma iterator
* @ mm : The mm_struct
*/
static inline void vma_complete ( struct vma_prepare * vp ,
struct vma_iterator * vmi , struct mm_struct * mm )
{
if ( vp - > file ) {
if ( vp - > adj_next )
vma_interval_tree_insert ( vp - > adj_next ,
& vp - > mapping - > i_mmap ) ;
vma_interval_tree_insert ( vp - > vma , & vp - > mapping - > i_mmap ) ;
flush_dcache_mmap_unlock ( vp - > mapping ) ;
}
if ( vp - > remove & & vp - > file ) {
__remove_shared_vm_struct ( vp - > remove , vp - > file , vp - > mapping ) ;
if ( vp - > remove2 )
__remove_shared_vm_struct ( vp - > remove2 , vp - > file ,
vp - > mapping ) ;
} else if ( vp - > insert ) {
/*
* split_vma has split insert from vma , and needs
* us to insert it before dropping the locks
* ( it may either follow vma or precede it ) .
*/
vma_iter_store ( vmi , vp - > insert ) ;
mm - > map_count + + ;
}
if ( vp - > anon_vma ) {
anon_vma_interval_tree_post_update_vma ( vp - > vma ) ;
if ( vp - > adj_next )
anon_vma_interval_tree_post_update_vma ( vp - > adj_next ) ;
anon_vma_unlock_write ( vp - > anon_vma ) ;
}
if ( vp - > file ) {
i_mmap_unlock_write ( vp - > mapping ) ;
uprobe_mmap ( vp - > vma ) ;
if ( vp - > adj_next )
uprobe_mmap ( vp - > adj_next ) ;
}
if ( vp - > remove ) {
again :
2023-02-27 20:36:21 +03:00
vma_mark_detached ( vp - > remove , true ) ;
2023-01-20 19:26:41 +03:00
if ( vp - > file ) {
uprobe_munmap ( vp - > remove , vp - > remove - > vm_start ,
vp - > remove - > vm_end ) ;
fput ( vp - > file ) ;
}
if ( vp - > remove - > anon_vma )
anon_vma_merge ( vp - > vma , vp - > remove ) ;
mm - > map_count - - ;
mpol_put ( vma_policy ( vp - > remove ) ) ;
if ( ! vp - > remove2 )
WARN_ON_ONCE ( vp - > vma - > vm_end < vp - > remove - > vm_end ) ;
vm_area_free ( vp - > remove ) ;
/*
* In mprotect ' s case 6 ( see comments on vma_merge ) ,
2023-03-09 14:12:51 +03:00
* we are removing both mid and next vmas
2023-01-20 19:26:41 +03:00
*/
if ( vp - > remove2 ) {
vp - > remove = vp - > remove2 ;
vp - > remove2 = NULL ;
goto again ;
}
}
if ( vp - > insert & & vp - > file )
uprobe_mmap ( vp - > insert ) ;
2023-07-14 22:55:48 +03:00
validate_mm ( mm ) ;
2023-01-20 19:26:41 +03:00
}
2023-01-20 19:26:47 +03:00
/*
* dup_anon_vma ( ) - Helper function to duplicate anon_vma
* @ dst : The destination VMA
* @ src : The source VMA
*
* Returns : 0 on success .
*/
static inline int dup_anon_vma ( struct vm_area_struct * dst ,
struct vm_area_struct * src )
{
/*
* Easily overlooked : when mprotect shifts the boundary , make sure the
* expanding vma has anon_vma set if the shrinking vma had , to cover any
* anon pages imported .
*/
if ( src - > anon_vma & & ! dst - > anon_vma ) {
mm: lock VMA in dup_anon_vma() before setting ->anon_vma
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by
another VMA. This currently happens while `dst` is not write-locked.
This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock. This is already
icky in itself, since such page faults can now install pages into `dst`
that are attached to an `anon_vma` that is not yet tied back to the
`anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due
to an out-of-memory error, things get much worse: `anon_vma_clone()` then
reverts `dst->anon_vma` back to NULL, and `dst` remains completely
unconnected to the `anon_vma`, even though we can have pages in the area
covered by `dst` that point to the `anon_vma`.
This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.
I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.
For a kernel-assisted reproducer, see the notes section of the patch mail.
Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-21 06:46:43 +03:00
vma_start_write ( dst ) ;
2023-01-20 19:26:47 +03:00
dst - > anon_vma = src - > anon_vma ;
return anon_vma_clone ( dst , src ) ;
}
return 0 ;
}
2023-01-20 19:26:42 +03:00
/*
* vma_expand - Expand an existing VMA
*
* @ vmi : The vma iterator
* @ vma : The vma to expand
* @ start : The start of the vma
* @ end : The exclusive end of the vma
* @ pgoff : The page offset of vma
* @ next : The current of next vma .
*
* Expand @ vma to @ start and @ end . Can expand off the start and end . Will
* expand over @ next if it ' s different from @ vma and @ end = = @ next - > vm_end .
* Checking if the @ vma can expand and merge with @ next needs to be handled by
* the caller .
*
* Returns : 0 on success
*/
2023-01-20 19:26:45 +03:00
int vma_expand ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
unsigned long start , unsigned long end , pgoff_t pgoff ,
struct vm_area_struct * next )
2023-01-20 19:26:42 +03:00
{
2023-01-20 19:26:43 +03:00
bool remove_next = false ;
2023-01-20 19:26:42 +03:00
struct vma_prepare vp ;
if ( next & & ( vma ! = next ) & & ( end = = next - > vm_end ) ) {
2023-01-20 19:26:47 +03:00
int ret ;
2023-01-20 19:26:42 +03:00
2023-01-20 19:26:47 +03:00
remove_next = true ;
ret = dup_anon_vma ( vma , next ) ;
if ( ret )
return ret ;
2023-01-20 19:26:42 +03:00
}
2023-01-20 19:26:43 +03:00
init_multi_vma_prep ( & vp , vma , NULL , remove_next ? next : NULL , NULL ) ;
2023-01-20 19:26:42 +03:00
/* Not merging but overwriting any part of next is not handled. */
VM_WARN_ON ( next & & ! vp . remove & &
next ! = vma & & end > next - > vm_start ) ;
/* Only handles expanding */
VM_WARN_ON ( vma - > vm_start < start | | vma - > vm_end > end ) ;
if ( vma_iter_prealloc ( vmi ) )
goto nomem ;
2023-02-27 20:36:13 +03:00
vma_prepare ( & vp ) ;
2023-01-20 19:26:42 +03:00
vma_adjust_trans_huge ( vma , start , end , 0 ) ;
/* VMA iterator points to previous, so set to start if necessary */
if ( vma_iter_addr ( vmi ) ! = start )
vma_iter_set ( vmi , start ) ;
vma - > vm_start = start ;
vma - > vm_end = end ;
vma - > vm_pgoff = pgoff ;
/* Note: mas must be pointing to the expanding VMA */
vma_iter_store ( vmi , vma ) ;
vma_complete ( & vp , vmi , vma - > vm_mm ) ;
return 0 ;
nomem :
return - ENOMEM ;
}
2023-01-20 19:26:46 +03:00
/*
* vma_shrink ( ) - Reduce an existing VMAs memory area
* @ vmi : The vma iterator
* @ vma : The VMA to modify
* @ start : The new start
* @ end : The new end
*
* Returns : 0 on success , - ENOMEM otherwise
*/
int vma_shrink ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
unsigned long start , unsigned long end , pgoff_t pgoff )
{
struct vma_prepare vp ;
WARN_ON ( ( vma - > vm_start ! = start ) & & ( vma - > vm_end ! = end ) ) ;
if ( vma_iter_prealloc ( vmi ) )
return - ENOMEM ;
init_vma_prep ( & vp , vma ) ;
vma_prepare ( & vp ) ;
2023-02-27 20:36:13 +03:00
vma_adjust_trans_huge ( vma , start , end , 0 ) ;
2023-01-20 19:26:46 +03:00
if ( vma - > vm_start < start )
vma_iter_clear ( vmi , vma - > vm_start , start ) ;
if ( vma - > vm_end > end )
vma_iter_clear ( vmi , end , vma - > vm_end ) ;
vma - > vm_start = start ;
vma - > vm_end = end ;
vma - > vm_pgoff = pgoff ;
vma_complete ( & vp , vmi , vma - > vm_mm ) ;
return 0 ;
}
2005-04-17 02:20:36 +04:00
/*
* If the vma has a - > close operation then the driver probably needs to release
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
* per - vma resources , so we don ' t attempt to merge those if the caller indicates
* the current vma may be removed as part of the merge .
2005-04-17 02:20:36 +04:00
*/
2023-03-09 14:12:56 +03:00
static inline bool is_mergeable_vma ( struct vm_area_struct * vma ,
struct file * file , unsigned long vm_flags ,
struct vm_userfaultfd_ctx vm_userfaultfd_ctx ,
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
struct anon_vma_name * anon_name , bool may_remove_vma )
2005-04-17 02:20:36 +04:00
{
2014-01-24 03:53:42 +04:00
/*
* VM_SOFTDIRTY should not prevent from VMA merging , if we
* match the flags but dirty bit - - the caller should mark
* merged VMA as dirty . If dirty bit won ' t be excluded from
2019-03-06 02:46:22 +03:00
* comparison , we increase pressure on the memory system forcing
2014-01-24 03:53:42 +04:00
* the kernel to generate new VMAs when old one could be
* extended instead .
*/
if ( ( vma - > vm_flags ^ vm_flags ) & ~ VM_SOFTDIRTY )
2023-03-09 14:12:56 +03:00
return false ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_file ! = file )
2023-03-09 14:12:56 +03:00
return false ;
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
if ( may_remove_vma & & vma - > vm_ops & & vma - > vm_ops - > close )
2023-03-09 14:12:56 +03:00
return false ;
2015-09-05 01:46:24 +03:00
if ( ! is_mergeable_vm_userfaultfd_ctx ( vma , vm_userfaultfd_ctx ) )
2023-03-09 14:12:56 +03:00
return false ;
2022-03-05 07:28:51 +03:00
if ( ! anon_vma_name_eq ( anon_vma_name ( vma ) , anon_name ) )
2023-03-09 14:12:56 +03:00
return false ;
return true ;
2005-04-17 02:20:36 +04:00
}
2023-03-09 14:12:56 +03:00
static inline bool is_mergeable_anon_vma ( struct anon_vma * anon_vma1 ,
struct anon_vma * anon_vma2 , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
2011-05-25 04:11:20 +04:00
/*
* The list_is_singular ( ) test is to avoid merging VMA cloned from
* parents . This can improve scalability caused by anon_vma lock .
*/
if ( ( ! anon_vma1 | | ! anon_vma2 ) & & ( ! vma | |
list_is_singular ( & vma - > anon_vma_chain ) ) )
2023-03-09 14:12:56 +03:00
return true ;
2011-05-25 04:11:20 +04:00
return anon_vma1 = = anon_vma2 ;
2005-04-17 02:20:36 +04:00
}
/*
* Return true if we can merge this ( vm_flags , anon_vma , file , vm_pgoff )
* in front of ( at a lower virtual address and file offset than ) the vma .
*
* We cannot merge two vmas if they have differently assigned ( non - NULL )
* anon_vmas , nor if same anon_vma is assigned but offsets incompatible .
*
* We don ' t check here for the merged mmap wrapping around the end of pagecache
2020-08-07 09:23:37 +03:00
* indices ( 16 TB on ia32 ) because do_mmap ( ) does not permit mmap ' s which
2005-04-17 02:20:36 +04:00
* wrap , nor mmaps which cover the final page at index - 1UL .
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
*
* We assume the vma may be removed as part of the merge .
2005-04-17 02:20:36 +04:00
*/
2023-03-09 14:12:56 +03:00
static bool
2005-04-17 02:20:36 +04:00
can_vma_merge_before ( struct vm_area_struct * vma , unsigned long vm_flags ,
2023-03-09 14:12:56 +03:00
struct anon_vma * anon_vma , struct file * file ,
pgoff_t vm_pgoff , struct vm_userfaultfd_ctx vm_userfaultfd_ctx ,
struct anon_vma_name * anon_name )
2005-04-17 02:20:36 +04:00
{
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
if ( is_mergeable_vma ( vma , file , vm_flags , vm_userfaultfd_ctx , anon_name , true ) & &
2011-05-25 04:11:20 +04:00
is_mergeable_anon_vma ( anon_vma , vma - > anon_vma , vma ) ) {
2005-04-17 02:20:36 +04:00
if ( vma - > vm_pgoff = = vm_pgoff )
2023-03-09 14:12:56 +03:00
return true ;
2005-04-17 02:20:36 +04:00
}
2023-03-09 14:12:56 +03:00
return false ;
2005-04-17 02:20:36 +04:00
}
/*
* Return true if we can merge this ( vm_flags , anon_vma , file , vm_pgoff )
* beyond ( at a higher virtual address and file offset than ) the vma .
*
* We cannot merge two vmas if they have differently assigned ( non - NULL )
* anon_vmas , nor if same anon_vma is assigned but offsets incompatible .
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
*
* We assume that vma is not removed as part of the merge .
2005-04-17 02:20:36 +04:00
*/
2023-03-09 14:12:56 +03:00
static bool
2005-04-17 02:20:36 +04:00
can_vma_merge_after ( struct vm_area_struct * vma , unsigned long vm_flags ,
2023-03-09 14:12:56 +03:00
struct anon_vma * anon_vma , struct file * file ,
pgoff_t vm_pgoff , struct vm_userfaultfd_ctx vm_userfaultfd_ctx ,
struct anon_vma_name * anon_name )
2005-04-17 02:20:36 +04:00
{
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit
d014cd7c1c35 as the next step to simplify mremap code again.
[vbabka@suse.cz: adjust comment as suggested by Lorenzo]
Link: https://lkml.kernel.org/r/74f2ea6c-f1a9-6dd7-260c-25e660f42379@suse.cz
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-09 14:12:57 +03:00
if ( is_mergeable_vma ( vma , file , vm_flags , vm_userfaultfd_ctx , anon_name , false ) & &
2011-05-25 04:11:20 +04:00
is_mergeable_anon_vma ( anon_vma , vma - > anon_vma , vma ) ) {
2005-04-17 02:20:36 +04:00
pgoff_t vm_pglen ;
2013-07-04 02:01:26 +04:00
vm_pglen = vma_pages ( vma ) ;
2005-04-17 02:20:36 +04:00
if ( vma - > vm_pgoff + vm_pglen = = vm_pgoff )
2023-03-09 14:12:56 +03:00
return true ;
2005-04-17 02:20:36 +04:00
}
2023-03-09 14:12:56 +03:00
return false ;
2005-04-17 02:20:36 +04:00
}
/*
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 01:05:59 +03:00
* Given a mapping request ( addr , end , vm_flags , file , pgoff , anon_name ) ,
* figure out whether that can be merged with its predecessor or its
* successor . Or both ( it neatly fills a hole ) .
2005-04-17 02:20:36 +04:00
*
* In most cases - when called for mmap , brk or mremap - [ addr , end ) is
* certain not to be mapped by the time vma_merge is called ; but when
* called for mprotect , it is certain to be already mapped ( either at
* an offset within prev , or at the start of next ) , and the flags of
* this area are about to be changed to vm_flags - and the no - change
* case has already been eliminated .
*
2023-03-21 23:45:55 +03:00
* The following mprotect cases have to be considered , where * * * * is
2005-04-17 02:20:36 +04:00
* the area passed down from mprotect_fixup , never extending beyond one
2023-03-21 23:45:55 +03:00
* vma , PPPP is the previous vma , CCCC is a concurrent vma that starts
* at the same address as * * * * and is of the same or larger span , and
* NNNN the next vma after * * * * :
2005-04-17 02:20:36 +04:00
*
2023-03-21 23:45:55 +03:00
* * * * * * * * * * * * *
* PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPCCCCCC
2019-12-01 04:57:39 +03:00
* cannot merge might become might become
2023-03-21 23:45:55 +03:00
* PPNNNNNNNNNN PPPPPPPPPPCC
2019-12-01 04:57:39 +03:00
* mmap , brk or case 4 below case 5 below
* mremap move :
2023-03-21 23:45:55 +03:00
* * * * * * * * *
* PPPP NNNN PPPPCCCCNNNN
2019-12-01 04:57:39 +03:00
* might become might become
* PPPPPPPPPPPP 1 or PPPPPPPPPPPP 6 or
2023-03-21 23:45:55 +03:00
* PPPPPPPPNNNN 2 or PPPPPPPPNNNN 7 or
* PPPPNNNNNNNN 3 PPPPNNNNNNNN 8
2005-04-17 02:20:36 +04:00
*
2023-03-21 23:45:55 +03:00
* It is important for case 8 that the vma CCCC overlapping the
* region * * * * is never going to extended over NNNN . Instead NNNN must
* be extended in region * * * * and CCCC must be removed . This way in
2023-01-20 19:26:49 +03:00
* all cases where vma_merge succeeds , the moment vma_merge drops the
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:01:28 +03:00
* rmap_locks , the properties of the merged vma will be already
* correct for the whole merged range . Some of those properties like
* vm_page_prot / vm_flags may be accessed by rmap_walks and they must
* be correct for the whole merged range immediately after the
2023-03-21 23:45:55 +03:00
* rmap_locks are released . Otherwise if NNNN would be removed and
* CCCC would be extended over the NNNN range , remove_migration_ptes
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:01:28 +03:00
* or other rmap walkers ( if working on addresses beyond the " end "
2023-03-21 23:45:55 +03:00
* parameter ) may establish ptes with the wrong permissions of CCCC
* instead of the right permissions of NNNN .
2023-01-20 19:26:49 +03:00
*
* In the code below :
* PPPP is represented by * prev
2023-03-21 23:45:55 +03:00
* CCCC is represented by * curr or not represented at all ( NULL )
* NNNN is represented by * next or not represented at all ( NULL )
* * * * * is not represented - it will be merged and the vma containing the
2023-03-09 14:12:54 +03:00
* area is returned , or the function will return NULL
2005-04-17 02:20:36 +04:00
*/
2023-01-20 19:26:30 +03:00
struct vm_area_struct * vma_merge ( struct vma_iterator * vmi , struct mm_struct * mm ,
2005-04-17 02:20:36 +04:00
struct vm_area_struct * prev , unsigned long addr ,
unsigned long end , unsigned long vm_flags ,
2014-10-10 02:26:29 +04:00
struct anon_vma * anon_vma , struct file * file ,
2015-09-05 01:46:24 +03:00
pgoff_t pgoff , struct mempolicy * policy ,
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 01:05:59 +03:00
struct vm_userfaultfd_ctx vm_userfaultfd_ctx ,
2022-03-05 07:28:51 +03:00
struct anon_vma_name * anon_name )
2005-04-17 02:20:36 +04:00
{
2023-03-22 23:18:59 +03:00
struct vm_area_struct * curr , * next , * res ;
2023-01-20 19:26:49 +03:00
struct vm_area_struct * vma , * adjust , * remove , * remove2 ;
2023-03-22 23:19:00 +03:00
struct vma_prepare vp ;
pgoff_t vma_pgoff ;
int err = 0 ;
2022-06-03 17:57:18 +03:00
bool merge_prev = false ;
bool merge_next = false ;
2023-01-20 19:26:49 +03:00
bool vma_expanded = false ;
unsigned long vma_start = addr ;
unsigned long vma_end = end ;
2023-03-22 23:19:00 +03:00
pgoff_t pglen = ( end - addr ) > > PAGE_SHIFT ;
2023-03-09 14:12:55 +03:00
long adj_start = 0 ;
2005-04-17 02:20:36 +04:00
/*
* We later require that vma - > vm_flags = = vm_flags ,
* so this tests vma - > vm_flags & VM_SPECIAL , too .
*/
if ( vm_flags & VM_SPECIAL )
return NULL ;
2023-03-22 23:18:58 +03:00
/* Does the input range span an existing VMA? (cases 5 - 8) */
curr = find_vma_intersection ( mm , prev ? prev - > vm_end : 0 , end ) ;
2005-04-17 02:20:36 +04:00
2023-03-22 23:18:58 +03:00
if ( ! curr | | /* cases 1 - 4 */
end = = curr - > vm_end ) /* cases 6 - 8, adjacent VMA */
next = vma_lookup ( mm , end ) ;
else
next = NULL ; /* case 5 */
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:01:28 +03:00
2023-01-20 19:26:49 +03:00
if ( prev ) {
vma_start = prev - > vm_start ;
vma_pgoff = prev - > vm_pgoff ;
2023-03-22 23:19:00 +03:00
2023-01-20 19:26:49 +03:00
/* Can we merge the predecessor? */
2023-03-22 23:19:00 +03:00
if ( addr = = prev - > vm_end & & mpol_equal ( vma_policy ( prev ) , policy )
2023-01-20 19:26:49 +03:00
& & can_vma_merge_after ( prev , vm_flags , anon_vma , file ,
2023-03-22 23:19:00 +03:00
pgoff , vm_userfaultfd_ctx , anon_name ) ) {
2023-01-20 19:26:49 +03:00
merge_prev = true ;
2023-01-20 19:26:50 +03:00
vma_prev ( vmi ) ;
2023-01-20 19:26:49 +03:00
}
2005-04-17 02:20:36 +04:00
}
2023-03-22 23:18:59 +03:00
2022-06-03 17:57:18 +03:00
/* Can we merge the successor? */
2023-03-22 23:18:58 +03:00
if ( next & & mpol_equal ( policy , vma_policy ( next ) ) & &
2023-03-22 23:19:00 +03:00
can_vma_merge_before ( next , vm_flags , anon_vma , file , pgoff + pglen ,
2023-03-22 23:18:58 +03:00
vm_userfaultfd_ctx , anon_name ) ) {
2022-06-03 17:57:18 +03:00
merge_next = true ;
}
2023-01-20 19:26:49 +03:00
2023-04-30 23:19:17 +03:00
/* Verify some invariant that must be enforced by the caller. */
VM_WARN_ON ( prev & & addr < = prev - > vm_start ) ;
VM_WARN_ON ( curr & & ( addr ! = curr - > vm_start | | end > curr - > vm_end ) ) ;
VM_WARN_ON ( addr > = end ) ;
2023-03-22 23:19:00 +03:00
if ( ! merge_prev & & ! merge_next )
return NULL ; /* Not mergeable. */
res = vma = prev ;
2023-01-20 19:26:49 +03:00
remove = remove2 = adjust = NULL ;
2023-03-22 23:19:00 +03:00
2022-06-03 17:57:18 +03:00
/* Can we merge both the predecessor and the successor? */
if ( merge_prev & & merge_next & &
2023-01-20 19:26:49 +03:00
is_mergeable_anon_vma ( prev - > anon_vma , next - > anon_vma , NULL ) ) {
2023-03-09 14:12:51 +03:00
remove = next ; /* case 1 */
2023-01-20 19:26:49 +03:00
vma_end = next - > vm_end ;
2023-03-09 14:12:51 +03:00
err = dup_anon_vma ( prev , next ) ;
2023-03-21 23:45:55 +03:00
if ( curr ) { /* case 6 */
remove = curr ;
2023-01-20 19:26:49 +03:00
remove2 = next ;
2023-03-09 14:12:51 +03:00
if ( ! next - > anon_vma )
2023-03-21 23:45:55 +03:00
err = dup_anon_vma ( prev , curr ) ;
2023-01-20 19:26:49 +03:00
}
2023-03-22 23:19:00 +03:00
} else if ( merge_prev ) { /* case 2 */
2023-03-21 23:45:55 +03:00
if ( curr ) {
err = dup_anon_vma ( prev , curr ) ;
if ( end = = curr - > vm_end ) { /* case 7 */
remove = curr ;
2023-01-20 19:26:49 +03:00
} else { /* case 5 */
2023-03-21 23:45:55 +03:00
adjust = curr ;
adj_start = ( end - curr - > vm_start ) ;
2023-01-20 19:26:49 +03:00
}
}
2023-03-22 23:19:00 +03:00
} else { /* merge_next */
2022-06-03 17:57:18 +03:00
res = next ;
2023-01-20 19:26:49 +03:00
if ( prev & & addr < prev - > vm_end ) { /* case 4 */
vma_end = addr ;
2023-03-09 14:12:52 +03:00
adjust = next ;
2023-03-09 14:12:55 +03:00
adj_start = - ( prev - > vm_end - addr ) ;
2023-03-09 14:12:52 +03:00
err = dup_anon_vma ( next , prev ) ;
2023-01-20 19:26:49 +03:00
} else {
2023-03-22 23:18:59 +03:00
/*
* Note that cases 3 and 8 are the ONLY ones where prev
* is permitted to be ( but is not necessarily ) NULL .
*/
2023-01-20 19:26:49 +03:00
vma = next ; /* case 3 */
vma_start = addr ;
vma_end = next - > vm_end ;
2023-04-27 17:09:59 +03:00
vma_pgoff = next - > vm_pgoff - pglen ;
2023-03-21 23:45:55 +03:00
if ( curr ) { /* case 8 */
vma_pgoff = curr - > vm_pgoff ;
remove = curr ;
err = dup_anon_vma ( next , curr ) ;
2023-01-20 19:26:49 +03:00
}
}
2005-04-17 02:20:36 +04:00
}
2023-03-22 23:19:00 +03:00
/* Error in anon_vma clone. */
2022-06-03 17:57:18 +03:00
if ( err )
return NULL ;
2023-01-20 19:26:49 +03:00
if ( vma_iter_prealloc ( vmi ) )
return NULL ;
init_multi_vma_prep ( & vp , vma , adjust , remove , remove2 ) ;
VM_WARN_ON ( vp . anon_vma & & adjust & & adjust - > anon_vma & &
vp . anon_vma ! = adjust - > anon_vma ) ;
vma_prepare ( & vp ) ;
2023-02-27 20:36:13 +03:00
vma_adjust_trans_huge ( vma , vma_start , vma_end , adj_start ) ;
2023-01-20 19:26:49 +03:00
if ( vma_start < vma - > vm_start | | vma_end > vma - > vm_end )
vma_expanded = true ;
vma - > vm_start = vma_start ;
vma - > vm_end = vma_end ;
vma - > vm_pgoff = vma_pgoff ;
if ( vma_expanded )
vma_iter_store ( vmi , vma ) ;
2023-03-09 14:12:55 +03:00
if ( adj_start ) {
adjust - > vm_start + = adj_start ;
adjust - > vm_pgoff + = adj_start > > PAGE_SHIFT ;
if ( adj_start < 0 ) {
2023-01-20 19:26:49 +03:00
WARN_ON ( vma_expanded ) ;
vma_iter_store ( vmi , next ) ;
}
}
vma_complete ( & vp , vmi , mm ) ;
2022-06-03 17:57:18 +03:00
khugepaged_enter_vma ( res , vm_flags ) ;
2023-01-20 19:26:30 +03:00
return res ;
2023-01-20 19:26:15 +03:00
}
2010-04-10 21:36:19 +04:00
/*
2020-06-05 02:49:04 +03:00
* Rough compatibility check to quickly see if it ' s even worth looking
2010-04-10 21:36:19 +04:00
* at sharing an anon_vma .
*
* They need to have the same vm_file , and the flags can only differ
* in things that mprotect may change .
*
* NOTE ! The fact that we share an anon_vma doesn ' t _have_ to mean that
* we can merge the two vma ' s . For example , we refuse to merge a vma if
* there is a vm_ops - > close ( ) function , because that indicates that the
* driver is doing some kind of reference counting . But that doesn ' t
* really matter for the anon_vma sharing case .
*/
static int anon_vma_compatible ( struct vm_area_struct * a , struct vm_area_struct * b )
{
return a - > vm_end = = b - > vm_start & &
mpol_equal ( vma_policy ( a ) , vma_policy ( b ) ) & &
a - > vm_file = = b - > vm_file & &
2020-04-11 00:33:09 +03:00
! ( ( a - > vm_flags ^ b - > vm_flags ) & ~ ( VM_ACCESS_FLAGS | VM_SOFTDIRTY ) ) & &
2010-04-10 21:36:19 +04:00
b - > vm_pgoff = = a - > vm_pgoff + ( ( b - > vm_start - a - > vm_start ) > > PAGE_SHIFT ) ;
}
/*
* Do some basic sanity checking to see if we can re - use the anon_vma
* from ' old ' . The ' a ' / ' b ' vma ' s are in VM order - one of them will be
* the same as ' old ' , the other will be the new one that is trying
* to share the anon_vma .
*
2022-05-10 04:20:53 +03:00
* NOTE ! This runs with mmap_lock held for reading , so it is possible that
2010-04-10 21:36:19 +04:00
* the anon_vma of ' old ' is concurrently in the process of being set up
* by another page fault trying to merge _that_ . But that ' s ok : if it
* is being set up , that automatically means that it will be a singleton
* acceptable for merging , so we can do all of this optimistically . But
2015-04-16 02:14:08 +03:00
* we do that READ_ONCE ( ) to make sure that we never re - load the pointer .
2010-04-10 21:36:19 +04:00
*
* IOW : that the " list_is_singular() " test on the anon_vma_chain only
* matters for the ' stable anon_vma ' case ( ie the thing we want to avoid
* is to return an anon_vma that is " complex " due to having gone through
* a fork ) .
*
* We also make sure that the two vma ' s are compatible ( adjacent ,
* and with the same memory policies ) . That ' s all stable , even with just
2022-05-10 04:20:53 +03:00
* a read lock on the mmap_lock .
2010-04-10 21:36:19 +04:00
*/
static struct anon_vma * reusable_anon_vma ( struct vm_area_struct * old , struct vm_area_struct * a , struct vm_area_struct * b )
{
if ( anon_vma_compatible ( a , b ) ) {
2015-04-16 02:14:08 +03:00
struct anon_vma * anon_vma = READ_ONCE ( old - > anon_vma ) ;
2010-04-10 21:36:19 +04:00
if ( anon_vma & & list_is_singular ( & old - > anon_vma_chain ) )
return anon_vma ;
}
return NULL ;
}
2005-04-17 02:20:36 +04:00
/*
* find_mergeable_anon_vma is used by anon_vma_prepare , to check
* neighbouring vmas for a suitable anon_vma , before it goes off
* to allocate a new anon_vma . It checks because a repetitive
* sequence of mprotects and faults may otherwise lead to distinct
* anon_vmas being allocated , preventing vma merge in subsequent
* mprotect .
*/
struct anon_vma * find_mergeable_anon_vma ( struct vm_area_struct * vma )
{
2022-09-06 22:49:06 +03:00
MA_STATE ( mas , & vma - > vm_mm - > mm_mt , vma - > vm_end , vma - > vm_end ) ;
2020-01-31 09:14:51 +03:00
struct anon_vma * anon_vma = NULL ;
2022-09-06 22:49:06 +03:00
struct vm_area_struct * prev , * next ;
2020-01-31 09:14:51 +03:00
/* Try next first. */
2022-09-06 22:49:06 +03:00
next = mas_walk ( & mas ) ;
if ( next ) {
anon_vma = reusable_anon_vma ( next , vma , next ) ;
2020-01-31 09:14:51 +03:00
if ( anon_vma )
return anon_vma ;
}
2022-09-06 22:49:06 +03:00
prev = mas_prev ( & mas , 0 ) ;
VM_BUG_ON_VMA ( prev ! = vma , vma ) ;
prev = mas_prev ( & mas , 0 ) ;
2020-01-31 09:14:51 +03:00
/* Try prev next. */
2022-09-06 22:49:06 +03:00
if ( prev )
anon_vma = reusable_anon_vma ( prev , prev , vma ) ;
2020-01-31 09:14:51 +03:00
2005-04-17 02:20:36 +04:00
/*
2020-01-31 09:14:51 +03:00
* We might reach here with anon_vma = = NULL if we can ' t find
* any reusable anon_vma .
2005-04-17 02:20:36 +04:00
* There ' s no absolute need to look only at touching neighbours :
* we could search further afield for " compatible " anon_vmas .
* But it would probably just be a waste of time searching ,
* or lead to too many vmas hanging off the same anon_vma .
* We ' re trying to allow mprotect remerging later on ,
* not trying to minimize memory used for anon_vmas .
*/
2020-01-31 09:14:51 +03:00
return anon_vma ;
2005-04-17 02:20:36 +04:00
}
2012-02-13 07:58:52 +04:00
/*
* If a hint addr is less than mmap_min_addr change hint to be as
* low as possible but still greater than mmap_min_addr
*/
static inline unsigned long round_hint_to_min ( unsigned long hint )
{
hint & = PAGE_MASK ;
if ( ( ( void * ) hint ! = NULL ) & &
( hint < mmap_min_addr ) )
return PAGE_ALIGN ( mmap_min_addr ) ;
return hint ;
}
2023-05-22 23:52:10 +03:00
bool mlock_future_ok ( struct mm_struct * mm , unsigned long flags ,
2023-05-22 11:24:12 +03:00
unsigned long bytes )
2014-01-22 03:49:15 +04:00
{
2023-05-22 11:24:12 +03:00
unsigned long locked_pages , limit_pages ;
2014-01-22 03:49:15 +04:00
2023-05-22 11:24:12 +03:00
if ( ! ( flags & VM_LOCKED ) | | capable ( CAP_IPC_LOCK ) )
return true ;
locked_pages = bytes > > PAGE_SHIFT ;
locked_pages + = mm - > locked_vm ;
limit_pages = rlimit ( RLIMIT_MEMLOCK ) ;
limit_pages > > = PAGE_SHIFT ;
return locked_pages < = limit_pages ;
2014-01-22 03:49:15 +04:00
}
mmap: introduce sane default mmap limits
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.
So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).
But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".
Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.
The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.
HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.
So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.
To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.
[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.
So the earlier the better, even if it's outside the proper development
cycle - Linus ]
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 19:52:01 +03:00
static inline u64 file_mmap_size_max ( struct file * file , struct inode * inode )
{
if ( S_ISREG ( inode - > i_mode ) )
mmap: relax file size limit for regular files
Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
introduced to catch problems in various ad-hoc character device drivers
doing mmap and getting the size limits wrong. In the process, it used
"known good" limits for the normal cases of mapping regular files and
block device drivers.
It turns out that the "s_maxbytes" limit was less "known good" than I
thought. In particular, /proc doesn't set it, but exposes one regular
file to mmap: /proc/vmcore. As a result, that file got limited to the
default MAX_INT s_maxbytes value.
This went unnoticed for a while, because apparently the only thing that
needs it is the s390 kernel zfcpdump, but there might be other tools
that use this too.
Vasily suggested just changing s_maxbytes for all of /proc, which isn't
wrong, but makes me nervous at this stage. So instead, just make the
new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
affect anything else. It wasn't the regular file case I was worried
about.
I'd really prefer for maxsize to have been per-inode, but that is not
how things are today.
Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-19 19:29:11 +03:00
return MAX_LFS_FILESIZE ;
mmap: introduce sane default mmap limits
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.
So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).
But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".
Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.
The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.
HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.
So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.
To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.
[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.
So the earlier the better, even if it's outside the proper development
cycle - Linus ]
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 19:52:01 +03:00
if ( S_ISBLK ( inode - > i_mode ) )
return MAX_LFS_FILESIZE ;
2019-09-24 01:39:28 +03:00
if ( S_ISSOCK ( inode - > i_mode ) )
return MAX_LFS_FILESIZE ;
mmap: introduce sane default mmap limits
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.
So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).
But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".
Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.
The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.
HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.
So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.
To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.
[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.
So the earlier the better, even if it's outside the proper development
cycle - Linus ]
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 19:52:01 +03:00
/* Special "we do even unsigned file positions" case */
if ( file - > f_mode & FMODE_UNSIGNED_OFFSET )
return 0 ;
/* Yes, random drivers might want more. But I'm tired of buggy drivers */
return ULONG_MAX ;
}
static inline bool file_mmap_ok ( struct file * file , struct inode * inode ,
unsigned long pgoff , unsigned long len )
{
u64 maxsize = file_mmap_size_max ( file , inode ) ;
if ( maxsize & & len > maxsize )
return false ;
maxsize - = len ;
if ( pgoff > maxsize > > PAGE_SHIFT )
return false ;
return true ;
}
2005-04-17 02:20:36 +04:00
/*
2020-06-09 07:33:51 +03:00
* The caller must write - lock current - > mm - > mmap_lock .
2005-04-17 02:20:36 +04:00
*/
2015-09-10 01:39:29 +03:00
unsigned long do_mmap ( struct file * file , unsigned long addr ,
2005-04-17 02:20:36 +04:00
unsigned long len , unsigned long prot ,
2020-08-07 09:23:37 +03:00
unsigned long flags , unsigned long pgoff ,
unsigned long * populate , struct list_head * uf )
2005-04-17 02:20:36 +04:00
{
2014-10-10 02:26:29 +04:00
struct mm_struct * mm = current - > mm ;
2020-08-07 09:23:37 +03:00
vm_flags_t vm_flags ;
mm/core, x86/mm/pkeys: Add execute-only protection keys support
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.
This patch uses protection keys to set up mappings to do just that.
If a user calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.
I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.
This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.
But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.
The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.
Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.
PKRU *is* a user register and the kernel is modifying it. That
means that code doing:
pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);
could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-13 00:02:40 +03:00
int pkey = 0 ;
2005-04-17 02:20:36 +04:00
2013-02-23 04:32:47 +04:00
* populate = 0 ;
2013-02-23 04:32:37 +04:00
2015-06-25 02:58:39 +03:00
if ( ! len )
return - EINVAL ;
2005-04-17 02:20:36 +04:00
/*
* Does the application expect PROT_READ to imply PROT_EXEC ?
*
* ( the exception is when the underlying filesystem is noexec
* mounted , in which case we dont add PROT_EXEC . )
*/
if ( ( prot & PROT_READ ) & & ( current - > personality & READ_IMPLIES_EXEC ) )
2015-06-29 22:42:03 +03:00
if ( ! ( file & & path_noexec ( & file - > f_path ) ) )
2005-04-17 02:20:36 +04:00
prot | = PROT_EXEC ;
mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g. for
shared or file backed mappings).
One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.
The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags. On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping. I do not see a good way
around that. Except we won't export expose the new semantic to the
userspace at all.
It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]
Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints. This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable. We would like to change that and supply a random address in a
: window of the address space. If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.
John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space. "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
: p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
: IMPROVEMENT: --> if instead, we could call this:
:
: p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
: , then we could skip the munmap() call upon failure. This
: is a small thing, but it is useful here. (Thanks to Piotr
: Jaroszynski and Mark Hairgrove for helping me get that detail
: exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
: q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.
Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.
The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
should follow. Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.
[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
This patch (of 2):
MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range. This
can cause silent memory corruptions. Some of them even with serious
security implications. While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range. Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g. arch specific code is
allowed to apply an alignment.
Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior. It has the same semantic as MAP_FIXED wrt. the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.
The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize the
flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.
[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 02:35:57 +03:00
/* force arch specific MAP_FIXED handling in get_unmapped_area */
if ( flags & MAP_FIXED_NOREPLACE )
flags | = MAP_FIXED ;
2007-11-27 02:47:40 +03:00
if ( ! ( flags & MAP_FIXED ) )
addr = round_hint_to_min ( addr ) ;
2005-04-17 02:20:36 +04:00
/* Careful about overflows.. */
len = PAGE_ALIGN ( len ) ;
2009-12-03 23:23:11 +03:00
if ( ! len )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
/* offset overflow? */
if ( ( pgoff + ( len > > PAGE_SHIFT ) ) < pgoff )
2014-10-10 02:26:29 +04:00
return - EOVERFLOW ;
2005-04-17 02:20:36 +04:00
/* Too many mappings? */
if ( mm - > map_count > sysctl_max_map_count )
return - ENOMEM ;
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space .
*/
addr = get_unmapped_area ( file , addr , len , pgoff , flags ) ;
2019-12-01 04:51:03 +03:00
if ( IS_ERR_VALUE ( addr ) )
2005-04-17 02:20:36 +04:00
return addr ;
mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g. for
shared or file backed mappings).
One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.
The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags. On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping. I do not see a good way
around that. Except we won't export expose the new semantic to the
userspace at all.
It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]
Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints. This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable. We would like to change that and supply a random address in a
: window of the address space. If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.
John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space. "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
: p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
: IMPROVEMENT: --> if instead, we could call this:
:
: p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
: , then we could skip the munmap() call upon failure. This
: is a small thing, but it is useful here. (Thanks to Piotr
: Jaroszynski and Mark Hairgrove for helping me get that detail
: exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
: q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.
Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.
The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
should follow. Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.
[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
This patch (of 2):
MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range. This
can cause silent memory corruptions. Some of them even with serious
security implications. While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range. Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g. arch specific code is
allowed to apply an alignment.
Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior. It has the same semantic as MAP_FIXED wrt. the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.
The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize the
flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.
[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 02:35:57 +03:00
if ( flags & MAP_FIXED_NOREPLACE ) {
2021-06-29 05:38:44 +03:00
if ( find_vma_intersection ( mm , addr , addr + len ) )
mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g. for
shared or file backed mappings).
One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.
The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags. On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping. I do not see a good way
around that. Except we won't export expose the new semantic to the
userspace at all.
It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]
Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints. This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable. We would like to change that and supply a random address in a
: window of the address space. If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.
John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space. "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
: p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
: IMPROVEMENT: --> if instead, we could call this:
:
: p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
: , then we could skip the munmap() call upon failure. This
: is a small thing, but it is useful here. (Thanks to Piotr
: Jaroszynski and Mark Hairgrove for helping me get that detail
: exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
: q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.
Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.
The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
should follow. Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.
[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
This patch (of 2):
MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range. This
can cause silent memory corruptions. Some of them even with serious
security implications. While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range. Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g. arch specific code is
allowed to apply an alignment.
Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior. It has the same semantic as MAP_FIXED wrt. the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.
The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize the
flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.
[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 02:35:57 +03:00
return - EEXIST ;
}
mm/core, x86/mm/pkeys: Add execute-only protection keys support
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.
This patch uses protection keys to set up mappings to do just that.
If a user calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.
I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.
This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.
But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.
The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.
Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.
PKRU *is* a user register and the kernel is modifying it. That
means that code doing:
pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);
could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-13 00:02:40 +03:00
if ( prot = = PROT_EXEC ) {
pkey = execute_only_pkey ( mm ) ;
if ( pkey < 0 )
pkey = 0 ;
}
2005-04-17 02:20:36 +04:00
/* Do simple checking here so the lower-level routines won't have
* to . we assume access permissions have been handled by the open
* of the memory object , so we don ' t do any here .
*/
2020-08-07 09:23:37 +03:00
vm_flags = calc_vm_prot_bits ( prot , pkey ) | calc_vm_flag_bits ( flags ) |
2005-04-17 02:20:36 +04:00
mm - > def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC ;
2009-09-22 04:03:36 +04:00
if ( flags & MAP_LOCKED )
2005-04-17 02:20:36 +04:00
if ( ! can_do_mlock ( ) )
return - EPERM ;
2008-10-19 07:26:50 +04:00
2023-05-22 23:52:10 +03:00
if ( ! mlock_future_ok ( mm , vm_flags , len ) )
2014-01-22 03:49:15 +04:00
return - EAGAIN ;
2005-04-17 02:20:36 +04:00
if ( file ) {
2013-09-12 01:20:19 +04:00
struct inode * inode = file_inode ( file ) ;
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 18:36:30 +03:00
unsigned long flags_mask ;
mmap: introduce sane default mmap limits
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.
So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).
But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".
Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.
The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.
HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.
So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.
To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.
[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.
So the earlier the better, even if it's outside the proper development
cycle - Linus ]
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 19:52:01 +03:00
if ( ! file_mmap_ok ( file , inode , pgoff , len ) )
return - EOVERFLOW ;
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 18:36:30 +03:00
flags_mask = LEGACY_MAP_MASK | file - > f_op - > mmap_supported_flags ;
2013-09-12 01:20:19 +04:00
2005-04-17 02:20:36 +04:00
switch ( flags & MAP_TYPE ) {
case MAP_SHARED :
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 18:36:30 +03:00
/*
* Force use of MAP_SHARED_VALIDATE with non - legacy
* flags . E . g . MAP_SYNC is dangerous to use with
* MAP_SHARED as you don ' t know which consistency model
* you will get . We silently ignore unsupported flags
* with MAP_SHARED to preserve backward compatibility .
*/
flags & = LEGACY_MAP_MASK ;
2020-04-07 06:08:39 +03:00
fallthrough ;
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 18:36:30 +03:00
case MAP_SHARED_VALIDATE :
if ( flags & ~ flags_mask )
return - EOPNOTSUPP ;
2019-08-20 17:55:16 +03:00
if ( prot & PROT_WRITE ) {
if ( ! ( file - > f_mode & FMODE_WRITE ) )
return - EACCES ;
if ( IS_SWAPFILE ( file - > f_mapping - > host ) )
return - ETXTBSY ;
}
2005-04-17 02:20:36 +04:00
/*
* Make sure we don ' t allow writing to an append - only
* file . .
*/
if ( IS_APPEND ( inode ) & & ( file - > f_mode & FMODE_WRITE ) )
return - EACCES ;
vm_flags | = VM_SHARED | VM_MAYSHARE ;
if ( ! ( file - > f_mode & FMODE_WRITE ) )
vm_flags & = ~ ( VM_MAYWRITE | VM_SHARED ) ;
2020-04-07 06:08:39 +03:00
fallthrough ;
2005-04-17 02:20:36 +04:00
case MAP_PRIVATE :
if ( ! ( file - > f_mode & FMODE_READ ) )
return - EACCES ;
2015-06-29 22:42:03 +03:00
if ( path_noexec ( & file - > f_path ) ) {
2006-10-16 01:09:55 +04:00
if ( vm_flags & VM_EXEC )
return - EPERM ;
vm_flags & = ~ VM_MAYEXEC ;
}
2013-09-23 00:27:52 +04:00
if ( ! file - > f_op - > mmap )
2006-10-16 01:09:55 +04:00
return - ENODEV ;
2013-09-12 01:20:18 +04:00
if ( vm_flags & ( VM_GROWSDOWN | VM_GROWSUP ) )
return - EINVAL ;
2005-04-17 02:20:36 +04:00
break ;
default :
return - EINVAL ;
}
} else {
switch ( flags & MAP_TYPE ) {
case MAP_SHARED :
2013-09-12 01:20:18 +04:00
if ( vm_flags & ( VM_GROWSDOWN | VM_GROWSUP ) )
return - EINVAL ;
2008-09-03 18:09:47 +04:00
/*
* Ignore pgoff .
*/
pgoff = 0 ;
2005-04-17 02:20:36 +04:00
vm_flags | = VM_SHARED | VM_MAYSHARE ;
break ;
case MAP_PRIVATE :
/*
* Set pgoff according to addr for anon_vma .
*/
pgoff = addr > > PAGE_SHIFT ;
break ;
default :
return - EINVAL ;
}
}
2013-02-23 04:32:43 +04:00
/*
* Set ' VM_NORESERVE ' if we should not account for the
* memory use of this mapping .
*/
if ( flags & MAP_NORESERVE ) {
/* We honor MAP_NORESERVE if allowed to overcommit */
if ( sysctl_overcommit_memory ! = OVERCOMMIT_NEVER )
vm_flags | = VM_NORESERVE ;
/* hugetlb applies strict overcommit unless MAP_NORESERVE */
if ( file & & is_file_hugepages ( file ) )
vm_flags | = VM_NORESERVE ;
}
2017-02-25 01:58:22 +03:00
addr = mmap_region ( file , addr , len , vm_flags , pgoff , uf ) ;
2013-03-29 03:26:23 +04:00
if ( ! IS_ERR_VALUE ( addr ) & &
( ( vm_flags & VM_LOCKED ) | |
( flags & ( MAP_POPULATE | MAP_NONBLOCK ) ) = = MAP_POPULATE ) )
2013-02-23 04:32:47 +04:00
* populate = len ;
2013-02-23 04:32:37 +04:00
return addr ;
2007-07-16 10:38:26 +04:00
}
2012-04-21 04:13:58 +04:00
2018-03-11 13:34:46 +03:00
unsigned long ksys_mmap_pgoff ( unsigned long addr , unsigned long len ,
unsigned long prot , unsigned long flags ,
unsigned long fd , unsigned long pgoff )
2009-12-30 23:17:34 +03:00
{
struct file * file = NULL ;
2015-11-06 05:48:35 +03:00
unsigned long retval ;
2009-12-30 23:17:34 +03:00
if ( ! ( flags & MAP_ANONYMOUS ) ) {
2010-10-30 10:54:44 +04:00
audit_mmap_fd ( fd , flags ) ;
2009-12-30 23:17:34 +03:00
file = fget ( fd ) ;
if ( ! file )
2015-11-06 05:48:35 +03:00
return - EBADF ;
mm/mmap: optimize a branch judgment in ksys_mmap_pgoff()
Look at the pseudo code below. It's very clear that, the judgement
"!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
only needed by the branch 3), because "retval" will be overwritten at 4).
No functional change, but it can reduce the code size. Maybe more clearer?
Before:
text data bss dec hex filename
28733 1590 1 30324 7674 mm/mmap.o
After:
text data bss dec hex filename
28701 1590 1 30292 7654 mm/mmap.o
====pseudo code====:
if (!(flags & MAP_ANONYMOUS)) {
...
1) if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
2) retval = -EINVAL;
3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
...
}
...
4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
...
return retval;
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:22:59 +03:00
if ( is_file_hugepages ( file ) ) {
2013-05-08 03:18:13 +04:00
len = ALIGN ( len , huge_page_size ( hstate_file ( file ) ) ) ;
mm/mmap: optimize a branch judgment in ksys_mmap_pgoff()
Look at the pseudo code below. It's very clear that, the judgement
"!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
only needed by the branch 3), because "retval" will be overwritten at 4).
No functional change, but it can reduce the code size. Maybe more clearer?
Before:
text data bss dec hex filename
28733 1590 1 30324 7674 mm/mmap.o
After:
text data bss dec hex filename
28701 1590 1 30292 7654 mm/mmap.o
====pseudo code====:
if (!(flags & MAP_ANONYMOUS)) {
...
1) if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
2) retval = -EINVAL;
3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
...
}
...
4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
...
return retval;
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:22:59 +03:00
} else if ( unlikely ( flags & MAP_HUGETLB ) ) {
retval = - EINVAL ;
2013-07-09 03:00:26 +04:00
goto out_fput ;
mm/mmap: optimize a branch judgment in ksys_mmap_pgoff()
Look at the pseudo code below. It's very clear that, the judgement
"!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
only needed by the branch 3), because "retval" will be overwritten at 4).
No functional change, but it can reduce the code size. Maybe more clearer?
Before:
text data bss dec hex filename
28733 1590 1 30324 7674 mm/mmap.o
After:
text data bss dec hex filename
28701 1590 1 30292 7654 mm/mmap.o
====pseudo code====:
if (!(flags & MAP_ANONYMOUS)) {
...
1) if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
2) retval = -EINVAL;
3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
...
}
...
4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
...
return retval;
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:22:59 +03:00
}
2009-12-30 23:17:34 +03:00
} else if ( flags & MAP_HUGETLB ) {
2013-07-09 03:01:08 +04:00
struct hstate * hs ;
2013-05-08 03:18:13 +04:00
2017-05-04 00:55:00 +03:00
hs = hstate_sizelog ( ( flags > > MAP_HUGE_SHIFT ) & MAP_HUGE_MASK ) ;
2013-05-09 11:08:15 +04:00
if ( ! hs )
return - EINVAL ;
len = ALIGN ( len , huge_page_size ( hs ) ) ;
2009-12-30 23:17:34 +03:00
/*
* VM_NORESERVE is used because the reservations will be
* taken when vm_ops - > mmap ( ) is called
*/
2013-05-08 03:18:13 +04:00
file = hugetlb_file_setup ( HUGETLB_ANON_FILE , len ,
2012-12-12 04:01:34 +04:00
VM_NORESERVE ,
2021-11-09 05:31:27 +03:00
HUGETLB_ANONHUGE_INODE ,
2012-12-12 04:01:34 +04:00
( flags > > MAP_HUGE_SHIFT ) & MAP_HUGE_MASK ) ;
2009-12-30 23:17:34 +03:00
if ( IS_ERR ( file ) )
return PTR_ERR ( file ) ;
}
2016-05-24 02:25:30 +03:00
retval = vm_mmap_pgoff ( file , addr , len , prot , flags , pgoff ) ;
2013-07-09 03:00:26 +04:00
out_fput :
2009-12-30 23:17:34 +03:00
if ( file )
fput ( file ) ;
return retval ;
}
2018-03-11 13:34:46 +03:00
SYSCALL_DEFINE6 ( mmap_pgoff , unsigned long , addr , unsigned long , len ,
unsigned long , prot , unsigned long , flags ,
unsigned long , fd , unsigned long , pgoff )
{
return ksys_mmap_pgoff ( addr , len , prot , flags , fd , pgoff ) ;
}
2010-03-11 02:21:15 +03:00
# ifdef __ARCH_WANT_SYS_OLD_MMAP
struct mmap_arg_struct {
unsigned long addr ;
unsigned long len ;
unsigned long prot ;
unsigned long flags ;
unsigned long fd ;
unsigned long offset ;
} ;
SYSCALL_DEFINE1 ( old_mmap , struct mmap_arg_struct __user * , arg )
{
struct mmap_arg_struct a ;
if ( copy_from_user ( & a , arg , sizeof ( a ) ) )
return - EFAULT ;
2015-11-06 05:46:54 +03:00
if ( offset_in_page ( a . offset ) )
2010-03-11 02:21:15 +03:00
return - EINVAL ;
2018-03-11 13:34:46 +03:00
return ksys_mmap_pgoff ( a . addr , a . len , a . prot , a . flags , a . fd ,
a . offset > > PAGE_SHIFT ) ;
2010-03-11 02:21:15 +03:00
}
# endif /* __ARCH_WANT_SYS_OLD_MMAP */
mm/mmap: separate writenotify and dirty tracking logic
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-05 00:27:51 +03:00
static bool vm_ops_needs_writenotify ( const struct vm_operations_struct * vm_ops )
{
return vm_ops & & ( vm_ops - > page_mkwrite | | vm_ops - > pfn_mkwrite ) ;
}
static bool vma_is_shared_writable ( struct vm_area_struct * vma )
{
return ( vma - > vm_flags & ( VM_WRITE | VM_SHARED ) ) = =
( VM_WRITE | VM_SHARED ) ;
}
static bool vma_fs_can_writeback ( struct vm_area_struct * vma )
{
/* No managed pages to writeback. */
if ( vma - > vm_flags & VM_PFNMAP )
return false ;
return vma - > vm_file & & vma - > vm_file - > f_mapping & &
mapping_can_writeback ( vma - > vm_file - > f_mapping ) ;
}
/*
* Does this VMA require the underlying folios to have their dirty state
* tracked ?
*/
bool vma_needs_dirty_tracking ( struct vm_area_struct * vma )
{
/* Only shared, writable VMAs require dirty tracking. */
if ( ! vma_is_shared_writable ( vma ) )
return false ;
/* Does the filesystem need to be notified? */
if ( vm_ops_needs_writenotify ( vma - > vm_ops ) )
return true ;
/*
* Even if the filesystem doesn ' t indicate a need for writenotify , if it
* can writeback , dirty tracking is still required .
*/
return vma_fs_can_writeback ( vma ) ;
}
2007-07-30 02:36:13 +04:00
/*
2019-03-06 02:46:22 +03:00
* Some shared mappings will want the pages marked read - only
2007-07-30 02:36:13 +04:00
* to track write events . If so , we ' ll downgrade vm_page_prot
* to the private version ( using protection_map [ ] without the
* VM_SHARED bit ) .
*/
2016-10-08 03:01:22 +03:00
int vma_wants_writenotify ( struct vm_area_struct * vma , pgprot_t vm_page_prot )
2007-07-30 02:36:13 +04:00
{
/* If it was private or non-writable, the write bit is already clear */
mm/mmap: separate writenotify and dirty tracking logic
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-05 00:27:51 +03:00
if ( ! vma_is_shared_writable ( vma ) )
2007-07-30 02:36:13 +04:00
return 0 ;
/* The backer wishes to know when pages are first written to? */
mm/mmap: separate writenotify and dirty tracking logic
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-05 00:27:51 +03:00
if ( vm_ops_needs_writenotify ( vma - > vm_ops ) )
2007-07-30 02:36:13 +04:00
return 1 ;
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
/* The open routine did something to the protections that pgprot_modify
* won ' t preserve ? */
2016-10-08 03:01:22 +03:00
if ( pgprot_val ( vm_page_prot ) ! =
mm/mmap: separate writenotify and dirty tracking logic
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-05 00:27:51 +03:00
pgprot_val ( vm_pgprot_modify ( vm_page_prot , vma - > vm_flags ) ) )
2007-07-30 02:36:13 +04:00
return 0 ;
mm/hugetlb: fix hugetlb not supporting softdirty tracking
Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
I observed that hugetlb does not support/expect write-faults in shared
mappings that would have to map the R/O-mapped page writable -- and I
found two case where we could currently get such faults and would
erroneously map an anon page into a shared mapping.
Reproducers part of the patches.
I propose to backport both fixes to stable trees. The first fix needs a
small adjustment.
This patch (of 2):
Staring at hugetlb_wp(), one might wonder where all the logic for shared
mappings is when stumbling over a write-protected page in a shared
mapping. In fact, there is none, and so far we thought we could get away
with that because e.g., mprotect() should always do the right thing and
map all pages directly writable.
Looks like we were wrong:
--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static void clear_softdirty(void)
{
int fd = open("/proc/self/clear_refs", O_WRONLY);
const char *ctrl = "4";
int ret;
if (fd < 0) {
fprintf(stderr, "open(clear_refs) failed\n");
exit(1);
}
ret = write(fd, ctrl, strlen(ctrl));
if (ret != strlen(ctrl)) {
fprintf(stderr, "write(clear_refs) failed\n");
exit(1);
}
close(fd);
}
int main(int argc, char **argv)
{
char *map;
int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
*map = 0;
if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
clear_softdirty();
if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
*map = 0;
return 0;
}
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Reason in this particular case is that vma_wants_writenotify() will
return "true", removing VM_SHARED in vma_set_page_prot() to map pages
write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
support softdirty tracking.
Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-11 13:34:34 +03:00
/*
* Do we need to track softdirty ? hugetlb does not support softdirty
* tracking yet .
*/
if ( vma_soft_dirty_enabled ( vma ) & & ! is_vm_hugetlb_page ( vma ) )
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:55:46 +04:00
return 1 ;
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 11:09:12 +03:00
/* Do we need write faults for uffd-wp tracking? */
if ( userfaultfd_wp ( vma ) )
return 1 ;
2007-07-30 02:36:13 +04:00
/* Can the mapping track the dirty pages? */
mm/mmap: separate writenotify and dirty tracking logic
Patch series "mm/gup: disallow GUP writing to file-backed mappings by
default", v9.
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
This patch (of 3):
vma_wants_writenotify() is specifically intended for setting PTE page
table flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.
Everything is predicated firstly on whether the mapping is shared
writable, as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.
All other checks are in line with existing logic, though now separated
into checks eplicitily for dirty tracking and those for determining how to
set page table flags.
We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.
Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-05 00:27:51 +03:00
return vma_fs_can_writeback ( vma ) ;
2007-07-30 02:36:13 +04:00
}
2009-02-01 02:08:56 +03:00
/*
* We account for memory if it ' s a private writeable mapping ,
2009-02-10 17:02:27 +03:00
* not hugepages and VM_NORESERVE wasn ' t set .
2009-02-01 02:08:56 +03:00
*/
2011-05-26 14:16:19 +04:00
static inline int accountable_mapping ( struct file * file , vm_flags_t vm_flags )
2009-02-01 02:08:56 +03:00
{
2009-02-10 17:02:27 +03:00
/*
* hugetlb has its own accounting separate from the core VM
* VM_HUGETLB may not be set yet so we cannot check for that flag .
*/
if ( file & & is_file_hugepages ( file ) )
return 0 ;
2009-02-01 02:08:56 +03:00
return ( vm_flags & ( VM_NORESERVE | VM_SHARED | VM_WRITE ) ) = = VM_WRITE ;
}
2022-09-06 22:48:47 +03:00
/**
* unmapped_area ( ) - Find an area between the low_limit and the high_limit with
* the correct alignment and offset , all from @ info . Note : current - > mm is used
* for the search .
*
2023-01-11 16:20:36 +03:00
* @ info : The unmapped area information including the range [ low_limit -
* high_limit ) , the alignment offset and mask .
2022-09-06 22:48:47 +03:00
*
* Return : A memory address or - ENOMEM .
*/
2020-04-02 07:09:10 +03:00
static unsigned long unmapped_area ( struct vm_unmapped_area_info * info )
2012-12-12 04:01:49 +04:00
{
2023-04-19 00:40:09 +03:00
unsigned long length , gap ;
unsigned long low_limit , high_limit ;
2023-04-14 21:59:19 +03:00
struct vm_area_struct * tmp ;
2012-12-12 04:01:49 +04:00
2022-09-06 22:48:47 +03:00
MA_STATE ( mas , & current - > mm - > mm_mt , 0 , 0 ) ;
2012-12-12 04:01:49 +04:00
/* Adjust search length to account for worst case alignment overhead */
length = info - > length + info - > align_mask ;
if ( length < info - > length )
return - ENOMEM ;
2023-04-14 21:59:19 +03:00
low_limit = info - > low_limit ;
2023-04-19 00:40:09 +03:00
if ( low_limit < mmap_min_addr )
low_limit = mmap_min_addr ;
high_limit = info - > high_limit ;
2023-04-14 21:59:19 +03:00
retry :
2023-04-19 00:40:09 +03:00
if ( mas_empty_area ( & mas , low_limit , high_limit - 1 , length ) )
2012-12-12 04:01:49 +04:00
return - ENOMEM ;
2022-09-06 22:48:47 +03:00
gap = mas . index ;
gap + = ( info - > align_offset - gap ) & info - > align_mask ;
2023-04-14 21:59:19 +03:00
tmp = mas_next ( & mas , ULONG_MAX ) ;
if ( tmp & & ( tmp - > vm_flags & VM_GROWSDOWN ) ) { /* Avoid prev check if possible */
if ( vm_start_gap ( tmp ) < gap + length - 1 ) {
low_limit = tmp - > vm_end ;
mas_reset ( & mas ) ;
goto retry ;
}
} else {
tmp = mas_prev ( & mas , 0 ) ;
if ( tmp & & vm_end_gap ( tmp ) > gap ) {
low_limit = vm_end_gap ( tmp ) ;
mas_reset ( & mas ) ;
goto retry ;
}
}
2022-09-06 22:48:47 +03:00
return gap ;
2012-12-12 04:01:49 +04:00
}
2022-09-06 22:48:47 +03:00
/**
* unmapped_area_topdown ( ) - Find an area between the low_limit and the
2023-01-11 16:20:36 +03:00
* high_limit with the correct alignment and offset at the highest available
2022-09-06 22:48:47 +03:00
* address , all from @ info . Note : current - > mm is used for the search .
*
2023-01-11 16:20:36 +03:00
* @ info : The unmapped area information including the range [ low_limit -
* high_limit ) , the alignment offset and mask .
2022-09-06 22:48:47 +03:00
*
* Return : A memory address or - ENOMEM .
*/
2020-04-02 07:09:10 +03:00
static unsigned long unmapped_area_topdown ( struct vm_unmapped_area_info * info )
2012-12-12 04:01:49 +04:00
{
2023-04-19 00:40:09 +03:00
unsigned long length , gap , gap_end ;
unsigned long low_limit , high_limit ;
2023-04-14 21:59:19 +03:00
struct vm_area_struct * tmp ;
2012-12-12 04:01:49 +04:00
2022-09-06 22:48:47 +03:00
MA_STATE ( mas , & current - > mm - > mm_mt , 0 , 0 ) ;
2012-12-12 04:01:49 +04:00
/* Adjust search length to account for worst case alignment overhead */
length = info - > length + info - > align_mask ;
if ( length < info - > length )
return - ENOMEM ;
2023-04-19 00:40:09 +03:00
low_limit = info - > low_limit ;
if ( low_limit < mmap_min_addr )
low_limit = mmap_min_addr ;
2023-04-14 21:59:19 +03:00
high_limit = info - > high_limit ;
retry :
2023-04-19 00:40:09 +03:00
if ( mas_empty_area_rev ( & mas , low_limit , high_limit - 1 , length ) )
2012-12-12 04:01:49 +04:00
return - ENOMEM ;
2022-09-06 22:48:47 +03:00
gap = mas . last + 1 - info - > length ;
gap - = ( gap - info - > align_offset ) & info - > align_mask ;
2023-04-14 21:59:19 +03:00
gap_end = mas . last ;
tmp = mas_next ( & mas , ULONG_MAX ) ;
if ( tmp & & ( tmp - > vm_flags & VM_GROWSDOWN ) ) { /* Avoid prev check if possible */
if ( vm_start_gap ( tmp ) < = gap_end ) {
high_limit = vm_start_gap ( tmp ) ;
mas_reset ( & mas ) ;
goto retry ;
}
} else {
tmp = mas_prev ( & mas , 0 ) ;
if ( tmp & & vm_end_gap ( tmp ) > gap ) {
high_limit = tmp - > vm_start ;
mas_reset ( & mas ) ;
goto retry ;
}
}
2022-09-06 22:48:47 +03:00
return gap ;
2012-12-12 04:01:49 +04:00
}
2020-04-02 07:09:10 +03:00
/*
* Search for an unmapped address range .
*
* We are looking for a range that :
* - does not intersect with any VMA ;
* - is contained within the [ low_limit , high_limit ) interval ;
* - is at least the desired size .
* - satisfies ( begin_addr & align_mask ) = = ( align_offset & align_mask )
*/
unsigned long vm_unmapped_area ( struct vm_unmapped_area_info * info )
{
2020-04-02 07:09:13 +03:00
unsigned long addr ;
2020-04-02 07:09:10 +03:00
if ( info - > flags & VM_UNMAPPED_AREA_TOPDOWN )
2020-04-02 07:09:13 +03:00
addr = unmapped_area_topdown ( info ) ;
2020-04-02 07:09:10 +03:00
else
2020-04-02 07:09:13 +03:00
addr = unmapped_area ( info ) ;
trace_vm_unmapped_area ( addr , info ) ;
return addr ;
2020-04-02 07:09:10 +03:00
}
2018-12-07 01:50:36 +03:00
2005-04-17 02:20:36 +04:00
/* Get an address range which is currently unmapped.
* For shmat ( ) with addr = 0.
*
* Ugly calling convention alert :
* Return value with the low bits set means error value ,
* ie
* if ( ret & ~ PAGE_MASK )
* error = ret ;
*
* This function " knows " that - ENOMEM has the bits set .
*/
unsigned long
2022-04-09 20:17:27 +03:00
generic_get_unmapped_area ( struct file * filp , unsigned long addr ,
unsigned long len , unsigned long pgoff ,
unsigned long flags )
2005-04-17 02:20:36 +04:00
{
struct mm_struct * mm = current - > mm ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
struct vm_area_struct * vma , * prev ;
2012-12-12 04:01:49 +04:00
struct vm_unmapped_area_info info ;
2022-04-09 20:17:28 +03:00
const unsigned long mmap_end = arch_get_mmap_end ( addr , len , flags ) ;
2005-04-17 02:20:36 +04:00
2018-12-07 01:50:36 +03:00
if ( len > mmap_end - mmap_min_addr )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
2007-05-07 01:50:13 +04:00
if ( flags & MAP_FIXED )
return addr ;
2005-04-17 02:20:36 +04:00
if ( addr ) {
addr = PAGE_ALIGN ( addr ) ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
vma = find_vma_prev ( mm , addr , & prev ) ;
2018-12-07 01:50:36 +03:00
if ( mmap_end - len > = addr & & addr > = mmap_min_addr & &
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
( ! vma | | addr + len < = vm_start_gap ( vma ) ) & &
( ! prev | | addr > = vm_end_gap ( prev ) ) )
2005-04-17 02:20:36 +04:00
return addr ;
}
2012-12-12 04:01:49 +04:00
info . flags = 0 ;
info . length = len ;
2013-11-13 03:07:54 +04:00
info . low_limit = mm - > mmap_base ;
2018-12-07 01:50:36 +03:00
info . high_limit = mmap_end ;
2012-12-12 04:01:49 +04:00
info . align_mask = 0 ;
2020-04-11 00:32:48 +03:00
info . align_offset = 0 ;
2012-12-12 04:01:49 +04:00
return vm_unmapped_area ( & info ) ;
2005-04-17 02:20:36 +04:00
}
2022-04-09 20:17:27 +03:00
# ifndef HAVE_ARCH_UNMAPPED_AREA
unsigned long
arch_get_unmapped_area ( struct file * filp , unsigned long addr ,
unsigned long len , unsigned long pgoff ,
unsigned long flags )
{
return generic_get_unmapped_area ( filp , addr , len , pgoff , flags ) ;
}
2014-10-10 02:26:29 +04:00
# endif
2005-04-17 02:20:36 +04:00
/*
* This mmap - allocator allocates new areas top - down from below the
* stack ' s low limit ( the base ) :
*/
unsigned long
2022-04-09 20:17:27 +03:00
generic_get_unmapped_area_topdown ( struct file * filp , unsigned long addr ,
unsigned long len , unsigned long pgoff ,
unsigned long flags )
2005-04-17 02:20:36 +04:00
{
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
struct vm_area_struct * vma , * prev ;
2005-04-17 02:20:36 +04:00
struct mm_struct * mm = current - > mm ;
2012-12-12 04:01:49 +04:00
struct vm_unmapped_area_info info ;
2022-04-09 20:17:28 +03:00
const unsigned long mmap_end = arch_get_mmap_end ( addr , len , flags ) ;
2005-04-17 02:20:36 +04:00
/* requested length too big for entire address space */
2018-12-07 01:50:36 +03:00
if ( len > mmap_end - mmap_min_addr )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
2007-05-07 01:50:13 +04:00
if ( flags & MAP_FIXED )
return addr ;
2005-04-17 02:20:36 +04:00
/* requesting a specific address */
if ( addr ) {
addr = PAGE_ALIGN ( addr ) ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
vma = find_vma_prev ( mm , addr , & prev ) ;
2018-12-07 01:50:36 +03:00
if ( mmap_end - len > = addr & & addr > = mmap_min_addr & &
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
( ! vma | | addr + len < = vm_start_gap ( vma ) ) & &
( ! prev | | addr > = vm_end_gap ( prev ) ) )
2005-04-17 02:20:36 +04:00
return addr ;
}
2012-12-12 04:01:49 +04:00
info . flags = VM_UNMAPPED_AREA_TOPDOWN ;
info . length = len ;
2023-04-19 00:40:09 +03:00
info . low_limit = PAGE_SIZE ;
2018-12-07 01:50:36 +03:00
info . high_limit = arch_get_mmap_base ( addr , mm - > mmap_base ) ;
2012-12-12 04:01:49 +04:00
info . align_mask = 0 ;
2020-04-11 00:32:48 +03:00
info . align_offset = 0 ;
2012-12-12 04:01:49 +04:00
addr = vm_unmapped_area ( & info ) ;
2012-03-22 03:33:56 +04:00
2005-04-17 02:20:36 +04:00
/*
* A failed mmap ( ) very likely causes application failure ,
* so fall back to the bottom - up function here . This scenario
* can happen with large stack limits and large mmap ( )
* allocations .
*/
2015-11-06 05:46:54 +03:00
if ( offset_in_page ( addr ) ) {
2012-12-12 04:01:49 +04:00
VM_BUG_ON ( addr ! = - ENOMEM ) ;
info . flags = 0 ;
info . low_limit = TASK_UNMAPPED_BASE ;
2018-12-07 01:50:36 +03:00
info . high_limit = mmap_end ;
2012-12-12 04:01:49 +04:00
addr = vm_unmapped_area ( & info ) ;
}
2005-04-17 02:20:36 +04:00
return addr ;
}
2022-04-09 20:17:27 +03:00
# ifndef HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
unsigned long
arch_get_unmapped_area_topdown ( struct file * filp , unsigned long addr ,
unsigned long len , unsigned long pgoff ,
unsigned long flags )
{
return generic_get_unmapped_area_topdown ( filp , addr , len , pgoff , flags ) ;
}
2005-04-17 02:20:36 +04:00
# endif
unsigned long
get_unmapped_area ( struct file * file , unsigned long addr , unsigned long len ,
unsigned long pgoff , unsigned long flags )
{
2007-05-07 01:50:13 +04:00
unsigned long ( * get_area ) ( struct file * , unsigned long ,
unsigned long , unsigned long , unsigned long ) ;
2009-12-03 23:23:11 +03:00
unsigned long error = arch_mmap_check ( addr , len , flags ) ;
if ( error )
return error ;
/* Careful about overflows.. */
if ( len > TASK_SIZE )
return - ENOMEM ;
2007-05-07 01:50:13 +04:00
get_area = current - > mm - > get_unmapped_area ;
2016-07-27 01:26:15 +03:00
if ( file ) {
if ( file - > f_op - > get_unmapped_area )
get_area = file - > f_op - > get_unmapped_area ;
} else if ( flags & MAP_SHARED ) {
/*
* mmap_region ( ) will call shmem_zero_setup ( ) to create a file ,
* so use shmem ' s get_unmapped_area in case it can be huge .
2020-08-07 09:23:37 +03:00
* do_mmap ( ) will clear pgoff , so match alignment .
2016-07-27 01:26:15 +03:00
*/
pgoff = 0 ;
get_area = shmem_get_unmapped_area ;
}
2007-05-07 01:50:13 +04:00
addr = get_area ( file , addr , len , pgoff , flags ) ;
if ( IS_ERR_VALUE ( addr ) )
return addr ;
2005-04-17 02:20:36 +04:00
2005-05-20 09:43:37 +04:00
if ( addr > TASK_SIZE - len )
return - ENOMEM ;
2015-11-06 05:46:54 +03:00
if ( offset_in_page ( addr ) )
2005-05-20 09:43:37 +04:00
return - EINVAL ;
2007-05-07 01:50:13 +04:00
2012-05-31 01:13:15 +04:00
error = security_mmap_addr ( addr ) ;
return error ? error : addr ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( get_unmapped_area ) ;
2022-09-06 22:48:50 +03:00
/**
* find_vma_intersection ( ) - Look up the first VMA which intersects the interval
* @ mm : The process address space .
* @ start_addr : The inclusive start user address .
* @ end_addr : The exclusive end user address .
*
* Returns : The first VMA within the provided range , % NULL otherwise . Assumes
* start_addr < end_addr .
*/
struct vm_area_struct * find_vma_intersection ( struct mm_struct * mm ,
unsigned long start_addr ,
unsigned long end_addr )
{
unsigned long index = start_addr ;
mmap_assert_locked ( mm ) ;
2022-09-06 22:48:51 +03:00
return mt_find ( & mm - > mm_mt , & index , end_addr - 1 ) ;
2022-09-06 22:48:50 +03:00
}
EXPORT_SYMBOL ( find_vma_intersection ) ;
2022-09-06 22:48:46 +03:00
/**
* find_vma ( ) - Find the VMA for a given address , or the next VMA .
* @ mm : The mm_struct to check
* @ addr : The address
*
* Returns : The VMA associated with addr , or the next VMA .
* May return % NULL in the case of no VMA at addr or above .
*/
2009-01-07 01:40:21 +03:00
struct vm_area_struct * find_vma ( struct mm_struct * mm , unsigned long addr )
2005-04-17 02:20:36 +04:00
{
2022-09-06 22:48:46 +03:00
unsigned long index = addr ;
2005-04-17 02:20:36 +04:00
2021-09-03 00:56:46 +03:00
mmap_assert_locked ( mm ) ;
2022-09-06 22:48:51 +03:00
return mt_find ( & mm - > mm_mt , & index , ULONG_MAX ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL ( find_vma ) ;
2022-09-06 22:48:47 +03:00
/**
* find_vma_prev ( ) - Find the VMA for a given address , or the next vma and
* set % pprev to the previous VMA , if any .
* @ mm : The mm_struct to check
* @ addr : The address
* @ pprev : The pointer to set to the previous VMA
*
* Note that RCU lock is missing here since the external mmap_lock ( ) is used
* instead .
*
* Returns : The VMA associated with @ addr , or the next vma .
* May return % NULL in the case of no vma at addr or above .
2012-01-11 03:08:07 +04:00
*/
2005-04-17 02:20:36 +04:00
struct vm_area_struct *
find_vma_prev ( struct mm_struct * mm , unsigned long addr ,
struct vm_area_struct * * pprev )
{
2012-01-11 03:08:07 +04:00
struct vm_area_struct * vma ;
2022-09-06 22:48:47 +03:00
MA_STATE ( mas , & mm - > mm_mt , addr , addr ) ;
2005-04-17 02:20:36 +04:00
2022-09-06 22:48:47 +03:00
vma = mas_walk ( & mas ) ;
* pprev = mas_prev ( & mas , 0 ) ;
if ( ! vma )
vma = mas_next ( & mas , ULONG_MAX ) ;
2012-01-11 03:08:07 +04:00
return vma ;
2005-04-17 02:20:36 +04:00
}
/*
* Verify that the stack growth is acceptable and
* update accounting . This is shared with both the
* grow - up and grow - down cases .
*/
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
static int acct_stack_growth ( struct vm_area_struct * vma ,
unsigned long size , unsigned long grow )
2005-04-17 02:20:36 +04:00
{
struct mm_struct * mm = vma - > vm_mm ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
unsigned long new_start ;
2005-04-17 02:20:36 +04:00
/* address space limit tests */
2016-01-15 02:22:07 +03:00
if ( ! may_expand_vm ( mm , vma - > vm_flags , grow ) )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
/* Stack limit test */
2017-07-11 01:50:03 +03:00
if ( size > rlimit ( RLIMIT_STACK ) )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
/* mlock limit tests */
2023-05-22 23:52:10 +03:00
if ( ! mlock_future_ok ( mm , vma - > vm_flags , grow < < PAGE_SHIFT ) )
2022-04-29 09:16:12 +03:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2007-01-31 01:35:39 +03:00
/* Check to ensure the stack will not grow into a hugetlb-only region */
new_start = ( vma - > vm_flags & VM_GROWSUP ) ? vma - > vm_start :
vma - > vm_end - size ;
if ( is_hugepage_only_range ( vma - > vm_mm , new_start , size ) )
return - EFAULT ;
2005-04-17 02:20:36 +04:00
/*
* Overcommit . . This must be the final test , as it will
* update security statistics .
*/
2009-04-17 00:58:12 +04:00
if ( security_vm_enough_memory_mm ( mm , grow ) )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
return 0 ;
}
2005-10-30 04:16:20 +03:00
# if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
2005-04-17 02:20:36 +04:00
/*
2005-10-30 04:16:20 +03:00
* PA - RISC uses this for its stack ; IA64 for its Register Backing Store .
* vma is the last one with address > vma - > vm_end . Have to extend vma .
2005-04-17 02:20:36 +04:00
*/
2023-06-24 23:45:51 +03:00
static int expand_upwards ( struct vm_area_struct * vma , unsigned long address )
2005-04-17 02:20:36 +04:00
{
2015-11-06 05:48:17 +03:00
struct mm_struct * mm = vma - > vm_mm ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
struct vm_area_struct * next ;
unsigned long gap_addr ;
2016-02-06 02:36:50 +03:00
int error = 0 ;
2022-09-06 22:48:45 +03:00
MA_STATE ( mas , & mm - > mm_mt , 0 , 0 ) ;
2005-04-17 02:20:36 +04:00
if ( ! ( vma - > vm_flags & VM_GROWSUP ) )
return - EFAULT ;
2017-06-19 18:34:05 +03:00
/* Guard against exceeding limits of the address space. */
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
address & = PAGE_MASK ;
2017-07-15 00:49:38 +03:00
if ( address > = ( TASK_SIZE & PAGE_MASK ) )
2016-02-06 02:36:50 +03:00
return - ENOMEM ;
2017-06-19 18:34:05 +03:00
address + = PAGE_SIZE ;
2016-02-06 02:36:50 +03:00
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
/* Enforce stack_guard_gap */
gap_addr = address + stack_guard_gap ;
2017-06-19 18:34:05 +03:00
/* Guard against overflow */
if ( gap_addr < address | | gap_addr > TASK_SIZE )
gap_addr = TASK_SIZE ;
2022-09-06 22:49:06 +03:00
next = find_vma_intersection ( mm , vma - > vm_end , gap_addr ) ;
if ( next & & vma_is_accessible ( next ) ) {
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
if ( ! ( next - > vm_flags & VM_GROWSUP ) )
return - ENOMEM ;
/* Check that both stack segments have the same anon_vma? */
}
2023-01-10 18:42:11 +03:00
if ( mas_preallocate ( & mas , GFP_KERNEL ) )
2022-09-06 22:48:45 +03:00
return - ENOMEM ;
2016-02-06 02:36:50 +03:00
/* We must make sure the anon_vma is allocated. */
2022-09-06 22:48:45 +03:00
if ( unlikely ( anon_vma_prepare ( vma ) ) ) {
mas_destroy ( & mas ) ;
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
2022-09-06 22:48:45 +03:00
}
2005-04-17 02:20:36 +04:00
2023-07-08 22:12:10 +03:00
/* Lock the VMA before expanding to prevent concurrent page faults */
vma_start_write ( vma ) ;
2005-04-17 02:20:36 +04:00
/*
* vma - > vm_start / vm_end cannot change under us because the caller
2020-06-09 07:33:54 +03:00
* is required to hold the mmap_lock in read mode . We need the
2005-04-17 02:20:36 +04:00
* anon_vma lock to serialize against concurrent expand_stacks .
*/
2016-02-06 02:36:50 +03:00
anon_vma_lock_write ( vma - > anon_vma ) ;
2005-04-17 02:20:36 +04:00
/* Somebody else might have raced and expanded it already */
if ( address > vma - > vm_end ) {
unsigned long size , grow ;
size = address - vma - > vm_start ;
grow = ( address - vma - > vm_end ) > > PAGE_SHIFT ;
2011-05-10 04:44:42 +04:00
error = - ENOMEM ;
if ( vma - > vm_pgoff + ( size > > PAGE_SHIFT ) > = vma - > vm_pgoff ) {
error = acct_stack_growth ( vma , size , grow ) ;
if ( ! error ) {
2012-12-13 01:52:25 +04:00
/*
2022-09-06 22:48:48 +03:00
* We only hold a shared mmap_lock lock here , so
* we need to protect against concurrent vma
* expansions . anon_vma_lock_write ( ) doesn ' t
* help here , as we don ' t guarantee that all
* growable vmas in a mm share the same root
* anon vma . So , we reuse mm - > page_table_lock
* to guard against concurrent vma expansions .
2012-12-13 01:52:25 +04:00
*/
2015-11-06 05:48:17 +03:00
spin_lock ( & mm - > page_table_lock ) ;
2015-11-06 05:48:14 +03:00
if ( vma - > vm_flags & VM_LOCKED )
2015-11-06 05:48:17 +03:00
mm - > locked_vm + = grow ;
2016-01-15 02:22:07 +03:00
vm_stat_account ( mm , vma - > vm_flags , grow ) ;
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
anon_vma_interval_tree_pre_update_vma ( vma ) ;
2011-05-10 04:44:42 +04:00
vma - > vm_end = address ;
2022-09-06 22:48:45 +03:00
/* Overwrite old entry in mtree. */
2023-01-20 19:26:32 +03:00
mas_set_range ( & mas , vma - > vm_start , address - 1 ) ;
mas_store_prealloc ( & mas , vma ) ;
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
anon_vma_interval_tree_post_update_vma ( vma ) ;
2015-11-06 05:48:17 +03:00
spin_unlock ( & mm - > page_table_lock ) ;
2012-12-13 01:52:25 +04:00
2011-05-10 04:44:42 +04:00
perf_event_mmap ( vma ) ;
}
2010-05-18 18:30:49 +04:00
}
2005-04-17 02:20:36 +04:00
}
2016-02-06 02:36:50 +03:00
anon_vma_unlock_write ( vma - > anon_vma ) ;
2022-05-20 00:08:50 +03:00
khugepaged_enter_vma ( vma , vma - > vm_flags ) ;
2022-09-06 22:48:45 +03:00
mas_destroy ( & mas ) ;
2023-07-14 22:55:48 +03:00
validate_mm ( mm ) ;
2005-04-17 02:20:36 +04:00
return error ;
}
2005-10-30 04:16:20 +03:00
# endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
2005-04-17 02:20:36 +04:00
/*
* vma is the first one with address < vma - > vm_start . Have to extend vma .
2023-06-24 23:45:51 +03:00
* mmap_lock held for writing .
2005-04-17 02:20:36 +04:00
*/
2022-09-06 22:48:48 +03:00
int expand_downwards ( struct vm_area_struct * vma , unsigned long address )
2005-04-17 02:20:36 +04:00
{
2015-11-06 05:48:17 +03:00
struct mm_struct * mm = vma - > vm_mm ;
2022-09-06 22:49:06 +03:00
MA_STATE ( mas , & mm - > mm_mt , vma - > vm_start , vma - > vm_start ) ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
struct vm_area_struct * prev ;
2019-02-27 23:29:52 +03:00
int error = 0 ;
2005-04-17 02:20:36 +04:00
2023-06-24 23:45:51 +03:00
if ( ! ( vma - > vm_flags & VM_GROWSDOWN ) )
return - EFAULT ;
2007-11-27 02:47:26 +03:00
address & = PAGE_MASK ;
2023-06-22 22:24:30 +03:00
if ( address < mmap_min_addr | | address < FIRST_USER_ADDRESS )
2019-02-27 23:29:52 +03:00
return - EPERM ;
2007-11-27 02:47:26 +03:00
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
/* Enforce stack_guard_gap */
2022-09-06 22:49:06 +03:00
prev = mas_prev ( & mas , 0 ) ;
2017-07-11 01:49:54 +03:00
/* Check that both stack segments have the same anon_vma? */
2023-06-17 01:58:54 +03:00
if ( prev ) {
if ( ! ( prev - > vm_flags & VM_GROWSDOWN ) & &
vma_is_accessible ( prev ) & &
( address - prev - > vm_end < stack_guard_gap ) )
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
return - ENOMEM ;
}
2023-01-10 18:42:11 +03:00
if ( mas_preallocate ( & mas , GFP_KERNEL ) )
2022-09-06 22:48:45 +03:00
return - ENOMEM ;
2016-02-06 02:36:50 +03:00
/* We must make sure the anon_vma is allocated. */
2022-09-06 22:48:45 +03:00
if ( unlikely ( anon_vma_prepare ( vma ) ) ) {
mas_destroy ( & mas ) ;
2016-02-06 02:36:50 +03:00
return - ENOMEM ;
2022-09-06 22:48:45 +03:00
}
2005-04-17 02:20:36 +04:00
2023-07-08 22:12:10 +03:00
/* Lock the VMA before expanding to prevent concurrent page faults */
vma_start_write ( vma ) ;
2005-04-17 02:20:36 +04:00
/*
* vma - > vm_start / vm_end cannot change under us because the caller
2020-06-09 07:33:54 +03:00
* is required to hold the mmap_lock in read mode . We need the
2005-04-17 02:20:36 +04:00
* anon_vma lock to serialize against concurrent expand_stacks .
*/
2016-02-06 02:36:50 +03:00
anon_vma_lock_write ( vma - > anon_vma ) ;
2005-04-17 02:20:36 +04:00
/* Somebody else might have raced and expanded it already */
if ( address < vma - > vm_start ) {
unsigned long size , grow ;
size = vma - > vm_end - address ;
grow = ( vma - > vm_start - address ) > > PAGE_SHIFT ;
2011-04-13 19:07:28 +04:00
error = - ENOMEM ;
if ( grow < = vma - > vm_pgoff ) {
error = acct_stack_growth ( vma , size , grow ) ;
if ( ! error ) {
2012-12-13 01:52:25 +04:00
/*
2022-09-06 22:48:48 +03:00
* We only hold a shared mmap_lock lock here , so
* we need to protect against concurrent vma
* expansions . anon_vma_lock_write ( ) doesn ' t
* help here , as we don ' t guarantee that all
* growable vmas in a mm share the same root
* anon vma . So , we reuse mm - > page_table_lock
* to guard against concurrent vma expansions .
2012-12-13 01:52:25 +04:00
*/
2015-11-06 05:48:17 +03:00
spin_lock ( & mm - > page_table_lock ) ;
2015-11-06 05:48:14 +03:00
if ( vma - > vm_flags & VM_LOCKED )
2015-11-06 05:48:17 +03:00
mm - > locked_vm + = grow ;
2016-01-15 02:22:07 +03:00
vm_stat_account ( mm , vma - > vm_flags , grow ) ;
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
anon_vma_interval_tree_pre_update_vma ( vma ) ;
2011-04-13 19:07:28 +04:00
vma - > vm_start = address ;
vma - > vm_pgoff - = grow ;
2022-09-06 22:48:45 +03:00
/* Overwrite old entry in mtree. */
2023-01-20 19:26:32 +03:00
mas_set_range ( & mas , address , vma - > vm_end - 1 ) ;
mas_store_prealloc ( & mas , vma ) ;
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
anon_vma_interval_tree_post_update_vma ( vma ) ;
2015-11-06 05:48:17 +03:00
spin_unlock ( & mm - > page_table_lock ) ;
2012-12-13 01:52:25 +04:00
2011-04-13 19:07:28 +04:00
perf_event_mmap ( vma ) ;
}
2005-04-17 02:20:36 +04:00
}
}
2016-02-06 02:36:50 +03:00
anon_vma_unlock_write ( vma - > anon_vma ) ;
2022-05-20 00:08:50 +03:00
khugepaged_enter_vma ( vma , vma - > vm_flags ) ;
2022-09-06 22:48:45 +03:00
mas_destroy ( & mas ) ;
2023-07-14 22:55:48 +03:00
validate_mm ( mm ) ;
2005-04-17 02:20:36 +04:00
return error ;
}
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
/* enforced gap between the expanding stack and other mappings. */
unsigned long stack_guard_gap = 256UL < < PAGE_SHIFT ;
static int __init cmdline_parse_stack_guard_gap ( char * p )
{
unsigned long val ;
char * endptr ;
val = simple_strtoul ( p , & endptr , 10 ) ;
if ( ! * endptr )
stack_guard_gap = val < < PAGE_SHIFT ;
2022-03-23 00:42:27 +03:00
return 1 ;
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 14:03:24 +03:00
}
__setup ( " stack_guard_gap= " , cmdline_parse_stack_guard_gap ) ;
2007-07-19 12:48:16 +04:00
# ifdef CONFIG_STACK_GROWSUP
2023-06-24 23:45:51 +03:00
int expand_stack_locked ( struct vm_area_struct * vma , unsigned long address )
2007-07-19 12:48:16 +04:00
{
return expand_upwards ( vma , address ) ;
}
2023-06-24 23:45:51 +03:00
struct vm_area_struct * find_extend_vma_locked ( struct mm_struct * mm , unsigned long addr )
2007-07-19 12:48:16 +04:00
{
struct vm_area_struct * vma , * prev ;
addr & = PAGE_MASK ;
vma = find_vma_prev ( mm , addr , & prev ) ;
if ( vma & & ( vma - > vm_start < = addr ) )
return vma ;
2023-06-17 01:58:54 +03:00
if ( ! prev )
return NULL ;
2023-06-24 23:45:51 +03:00
if ( expand_stack_locked ( prev , addr ) )
2007-07-19 12:48:16 +04:00
return NULL ;
2013-02-23 04:32:44 +04:00
if ( prev - > vm_flags & VM_LOCKED )
2015-04-15 01:44:39 +03:00
populate_vma_page_range ( prev , addr , prev - > vm_end , NULL ) ;
2007-07-19 12:48:16 +04:00
return prev ;
}
# else
2023-06-24 23:45:51 +03:00
int expand_stack_locked ( struct vm_area_struct * vma , unsigned long address )
2007-07-19 12:48:16 +04:00
{
2023-06-17 01:58:54 +03:00
if ( unlikely ( ! ( vma - > vm_flags & VM_GROWSDOWN ) ) )
return - EINVAL ;
2007-07-19 12:48:16 +04:00
return expand_downwards ( vma , address ) ;
}
2023-06-24 23:45:51 +03:00
struct vm_area_struct * find_extend_vma_locked ( struct mm_struct * mm , unsigned long addr )
2005-04-17 02:20:36 +04:00
{
2014-10-10 02:26:29 +04:00
struct vm_area_struct * vma ;
2005-04-17 02:20:36 +04:00
unsigned long start ;
addr & = PAGE_MASK ;
2014-10-10 02:26:29 +04:00
vma = find_vma ( mm , addr ) ;
2005-04-17 02:20:36 +04:00
if ( ! vma )
return NULL ;
if ( vma - > vm_start < = addr )
return vma ;
start = vma - > vm_start ;
2023-06-24 23:45:51 +03:00
if ( expand_stack_locked ( vma , addr ) )
2005-04-17 02:20:36 +04:00
return NULL ;
2013-02-23 04:32:44 +04:00
if ( vma - > vm_flags & VM_LOCKED )
2015-04-15 01:44:39 +03:00
populate_vma_page_range ( vma , addr , start , NULL ) ;
2005-04-17 02:20:36 +04:00
return vma ;
}
# endif
2023-06-24 23:45:51 +03:00
/*
* IA64 has some horrid mapping rules : it can expand both up and down ,
* but with various special rules .
*
* We ' ll get rid of this architecture eventually , so the ugliness is
* temporary .
*/
# ifdef CONFIG_IA64
static inline bool vma_expand_ok ( struct vm_area_struct * vma , unsigned long addr )
{
return REGION_NUMBER ( addr ) = = REGION_NUMBER ( vma - > vm_start ) & &
REGION_OFFSET ( addr ) < RGN_MAP_LIMIT ;
}
/*
* IA64 stacks grow down , but there ' s a special register backing store
* that can grow up . Only sequentially , though , so the new address must
* match vm_end .
*/
static inline int vma_expand_up ( struct vm_area_struct * vma , unsigned long addr )
{
if ( ! vma_expand_ok ( vma , addr ) )
return - EFAULT ;
if ( vma - > vm_end ! = ( addr & PAGE_MASK ) )
return - EFAULT ;
return expand_upwards ( vma , addr ) ;
}
static inline bool vma_expand_down ( struct vm_area_struct * vma , unsigned long addr )
{
if ( ! vma_expand_ok ( vma , addr ) )
return - EFAULT ;
return expand_downwards ( vma , addr ) ;
}
# elif defined(CONFIG_STACK_GROWSUP)
# define vma_expand_up(vma,addr) expand_upwards(vma, addr)
# define vma_expand_down(vma, addr) (-EFAULT)
# else
# define vma_expand_up(vma,addr) (-EFAULT)
# define vma_expand_down(vma, addr) expand_downwards(vma, addr)
# endif
/*
* expand_stack ( ) : legacy interface for page faulting . Don ' t use unless
* you have to .
*
* This is called with the mm locked for reading , drops the lock , takes
* the lock for writing , tries to look up a vma again , expands it if
* necessary , and downgrades the lock to reading again .
*
* If no vma is found or it can ' t be expanded , it returns NULL and has
* dropped the lock .
*/
struct vm_area_struct * expand_stack ( struct mm_struct * mm , unsigned long addr )
2023-06-17 01:58:54 +03:00
{
2023-06-24 23:45:51 +03:00
struct vm_area_struct * vma , * prev ;
mmap_read_unlock ( mm ) ;
if ( mmap_write_lock_killable ( mm ) )
return NULL ;
vma = find_vma_prev ( mm , addr , & prev ) ;
if ( vma & & vma - > vm_start < = addr )
goto success ;
if ( prev & & ! vma_expand_up ( prev , addr ) ) {
vma = prev ;
goto success ;
}
if ( vma & & ! vma_expand_down ( vma , addr ) )
goto success ;
mmap_write_unlock ( mm ) ;
return NULL ;
success :
mmap_write_downgrade ( mm ) ;
return vma ;
2023-06-17 01:58:54 +03:00
}
2014-12-13 03:55:27 +03:00
2005-04-17 02:20:36 +04:00
/*
2022-09-06 22:49:06 +03:00
* Ok - we have the memory areas we should free on a maple tree so release them ,
* and do the vma updates .
2005-10-30 04:15:56 +03:00
*
* Called with the mm semaphore held .
2005-04-17 02:20:36 +04:00
*/
2022-09-06 22:49:06 +03:00
static inline void remove_mt ( struct mm_struct * mm , struct ma_state * mas )
2005-04-17 02:20:36 +04:00
{
2012-05-07 00:54:06 +04:00
unsigned long nr_accounted = 0 ;
2022-09-06 22:49:06 +03:00
struct vm_area_struct * vma ;
2012-05-07 00:54:06 +04:00
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
/* Update high watermark before we lower total_vm */
update_hiwater_vm ( mm ) ;
2022-09-06 22:49:06 +03:00
mas_for_each ( mas , vma , ULONG_MAX ) {
2005-10-30 04:15:56 +03:00
long nrpages = vma_pages ( vma ) ;
2012-05-07 00:54:06 +04:00
if ( vma - > vm_flags & VM_ACCOUNT )
nr_accounted + = nrpages ;
2016-01-15 02:22:07 +03:00
vm_stat_account ( mm , vma - > vm_flags , - nrpages ) ;
2023-02-27 20:36:31 +03:00
remove_vma ( vma , false ) ;
2022-09-06 22:49:06 +03:00
}
2012-05-07 00:54:06 +04:00
vm_unacct_memory ( nr_accounted ) ;
2005-04-17 02:20:36 +04:00
}
/*
* Get rid of page table information in the indicated region .
*
2005-09-21 20:55:37 +04:00
* Called with the mm semaphore held .
2005-04-17 02:20:36 +04:00
*/
2022-09-06 22:49:06 +03:00
static void unmap_region ( struct mm_struct * mm , struct maple_tree * mt ,
[PATCH] freepgt: free_pgtables use vma list
Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
clear_page_range regression since 2.6.10's clear_page_tables; and its
long-standing well-known inefficiency in searching throughout the higher-level
page tables for those few entries to clear and free: all can be blamed on
ignoring the list of vmas when we free page tables.
Replace exit_mmap's clear_page_range of the total user address space by
free_pgtables operating on the mm's vma list; unmap_region use it in the same
way, giving floor and ceiling beyond which it may not free tables. This
brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
in which case latency fixes spoil unmap_vmas throughput).
Beware: the do_mmap_pgoff driver failure case must now use unmap_region
instead of zap_page_range, since a page table might have been allocated, and
can only be freed while it is touched by some vma.
Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
from the clear_page_range levels. (Most of free_pgtables' old code was
actually for a non-existent case, prev not properly set up, dating from before
hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
might want to add latency lockdrops later; but no attempt to do so yet, going
by vma should itself reduce latency.
But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
examination: put that off until a later patch of the series.
What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?
And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
we need to do more than is done here - every PMD_SIZE ever occupied will be
flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
A shame to complicate it unnecessarily.
Special thanks to David Miller for time spent repairing my ceilings.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-20 00:29:15 +04:00
struct vm_area_struct * vma , struct vm_area_struct * prev ,
2022-09-06 22:49:06 +03:00
struct vm_area_struct * next ,
2023-01-26 22:37:51 +03:00
unsigned long start , unsigned long end , bool mm_wr_locked )
2005-04-17 02:20:36 +04:00
{
2011-05-25 04:11:45 +04:00
struct mmu_gather tlb ;
2005-04-17 02:20:36 +04:00
lru_add_drain ( ) ;
2021-01-28 02:53:45 +03:00
tlb_gather_mmu ( & tlb , mm ) ;
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 04:16:18 +03:00
update_hiwater_rss ( mm ) ;
2023-01-26 22:37:51 +03:00
unmap_vmas ( & tlb , mt , vma , start , end , mm_wr_locked ) ;
2022-09-06 22:49:06 +03:00
free_pgtables ( & tlb , mt , vma , prev ? prev - > vm_end : FIRST_USER_ADDRESS ,
2023-02-27 20:36:18 +03:00
next ? next - > vm_start : USER_PGTABLES_CEILING ,
mm_wr_locked ) ;
2021-01-28 02:53:43 +03:00
tlb_finish_mmu ( & tlb ) ;
2005-04-17 02:20:36 +04:00
}
/*
2017-02-25 01:58:47 +03:00
* __split_vma ( ) bypasses sysctl_max_map_count checking . We use this where it
* has already been checked or doesn ' t make sense to fail .
2023-01-20 19:26:36 +03:00
* VMA Iterator will point to the end VMA .
2005-04-17 02:20:36 +04:00
*/
2023-01-20 19:26:30 +03:00
int __split_vma ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
2017-02-25 01:58:47 +03:00
unsigned long addr , int new_below )
2005-04-17 02:20:36 +04:00
{
2023-01-20 19:26:44 +03:00
struct vma_prepare vp ;
2005-04-17 02:20:36 +04:00
struct vm_area_struct * new ;
2015-09-09 01:03:38 +03:00
int err ;
2023-01-20 19:26:30 +03:00
2023-01-20 19:26:44 +03:00
WARN_ON ( vma - > vm_start > = addr ) ;
WARN_ON ( vma - > vm_end < = addr ) ;
2020-12-15 06:08:17 +03:00
if ( vma - > vm_ops & & vma - > vm_ops - > may_split ) {
err = vma - > vm_ops - > may_split ( vma , addr ) ;
2017-11-30 03:10:28 +03:00
if ( err )
return err ;
}
2005-04-17 02:20:36 +04:00
2018-07-21 23:48:51 +03:00
new = vm_area_dup ( vma ) ;
2005-04-17 02:20:36 +04:00
if ( ! new )
2015-09-09 01:03:38 +03:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2023-01-20 19:26:44 +03:00
err = - ENOMEM ;
if ( vma_iter_prealloc ( vmi ) )
goto out_free_vma ;
if ( new_below ) {
2005-04-17 02:20:36 +04:00
new - > vm_end = addr ;
2023-01-20 19:26:44 +03:00
} else {
2005-04-17 02:20:36 +04:00
new - > vm_start = addr ;
new - > vm_pgoff + = ( ( addr - vma - > vm_start ) > > PAGE_SHIFT ) ;
}
2013-09-12 01:20:14 +04:00
err = vma_dup_policy ( vma , new ) ;
if ( err )
2023-01-20 19:26:44 +03:00
goto out_free_vmi ;
2005-04-17 02:20:36 +04:00
2014-12-03 02:59:42 +03:00
err = anon_vma_clone ( new , vma ) ;
if ( err )
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
goto out_free_mpol ;
2012-10-09 03:28:54 +04:00
if ( new - > vm_file )
2005-04-17 02:20:36 +04:00
get_file ( new - > vm_file ) ;
if ( new - > vm_ops & & new - > vm_ops - > open )
new - > vm_ops - > open ( new ) ;
2023-01-20 19:26:44 +03:00
init_vma_prep ( & vp , vma ) ;
vp . insert = new ;
vma_prepare ( & vp ) ;
2023-02-27 20:36:13 +03:00
vma_adjust_trans_huge ( vma , vma - > vm_start , addr , 0 ) ;
2005-04-17 02:20:36 +04:00
2023-01-20 19:26:44 +03:00
if ( new_below ) {
vma - > vm_start = addr ;
vma - > vm_pgoff + = ( addr - new - > vm_start ) > > PAGE_SHIFT ;
} else {
vma - > vm_end = addr ;
2023-01-20 19:26:30 +03:00
}
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
2023-01-20 19:26:44 +03:00
/* vma_complete stores the new vma */
vma_complete ( & vp , vmi , vma - > vm_mm ) ;
/* Success. */
if ( new_below )
vma_next ( vmi ) ;
return 0 ;
out_free_mpol :
2013-09-12 01:20:14 +04:00
mpol_put ( vma_policy ( new ) ) ;
2023-01-20 19:26:44 +03:00
out_free_vmi :
vma_iter_free ( vmi ) ;
out_free_vma :
2018-07-21 23:48:51 +03:00
vm_area_free ( new ) ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
return err ;
2005-04-17 02:20:36 +04:00
}
mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:57:56 +03:00
/*
* Split a vma into two pieces at address ' addr ' , a new vma is allocated
* either for the first part or the tail .
*/
2023-01-20 19:26:30 +03:00
int split_vma ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:57:56 +03:00
unsigned long addr , int new_below )
{
2023-01-20 19:26:30 +03:00
if ( vma - > vm_mm - > map_count > = sysctl_max_map_count )
mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:57:56 +03:00
return - ENOMEM ;
2023-01-20 19:26:30 +03:00
return __split_vma ( vmi , vma , addr , new_below ) ;
2023-01-20 19:26:15 +03:00
}
2022-09-06 22:48:52 +03:00
/*
2023-01-20 19:26:13 +03:00
* do_vmi_align_munmap ( ) - munmap the aligned region from @ start to @ end .
* @ vmi : The vma iterator
2022-09-06 22:48:52 +03:00
* @ vma : The starting vm_area_struct
* @ mm : The mm_struct
* @ start : The aligned start address to munmap .
* @ end : The aligned end address to munmap .
* @ uf : The userfaultfd list_head
2023-06-30 05:28:16 +03:00
* @ unlock : Set to true to drop the mmap_lock . unlocking only happens on
* success .
2022-09-06 22:48:52 +03:00
*
2023-06-30 05:28:16 +03:00
* Return : 0 on success and drops the lock if so directed , error and leaves the
* lock held otherwise .
2022-09-06 22:48:52 +03:00
*/
static int
2023-01-20 19:26:13 +03:00
do_vmi_align_munmap ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
2022-09-06 22:48:52 +03:00
struct mm_struct * mm , unsigned long start ,
2023-06-30 05:28:16 +03:00
unsigned long end , struct list_head * uf , bool unlock )
2022-09-06 22:48:52 +03:00
{
2022-09-06 22:49:06 +03:00
struct vm_area_struct * prev , * next = NULL ;
struct maple_tree mt_detach ;
int count = 0 ;
2022-09-06 22:48:52 +03:00
int error = - ENOMEM ;
2023-06-18 03:47:08 +03:00
unsigned long locked_vm = 0 ;
2022-09-06 22:49:06 +03:00
MA_STATE ( mas_detach , & mt_detach , 0 , 0 ) ;
2023-02-27 20:36:07 +03:00
mt_init_flags ( & mt_detach , vmi - > mas . tree - > ma_flags & MT_FLAGS_LOCK_MASK ) ;
2022-09-06 22:49:06 +03:00
mt_set_external_lock ( & mt_detach , & mm - > mmap_lock ) ;
2022-09-06 22:48:45 +03:00
2005-04-17 02:20:36 +04:00
/*
* If we need to split any vma , do it now to save pain later .
*
* Note : mremap ' s move_vma VM_ACCOUNT handling assumes a partially
* unmapped vm_area_struct will remain in use : so lower split_vma
* places tmp vma above , and higher split_vma places tmp vma below .
*/
2022-09-06 22:49:06 +03:00
/* Does it split the first one? */
2005-04-20 00:29:18 +04:00
if ( start > vma - > vm_start ) {
mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:57:56 +03:00
/*
* Make sure that map_count on return from munmap ( ) will
* not exceed its limit ; but let map_count go just above
* its limit temporarily , to help free resources as expected .
*/
if ( end < vma - > vm_end & & mm - > map_count > = sysctl_max_map_count )
2022-09-06 22:48:45 +03:00
goto map_count_exceeded ;
mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()
On ia64, the following test program exit abnormally, because glibc thread
library called abort().
========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================
The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.
Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.
This patch does it.
test_max_mapcount.c
==================================================================
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
#include<errno.h>
#include<unistd.h>
#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)
void *wait_thread(void *args)
{
void *addr;
addr = malloc(MAL_SIZE);
sleep(10);
return NULL;
}
void *wait_thread2(void *args)
{
sleep(60);
return NULL;
}
int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;
ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}
ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}
for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}
sleep(3600);
return 0;
}
==================================================================
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:57:56 +03:00
2023-01-20 19:26:30 +03:00
error = __split_vma ( vmi , vma , start , 0 ) ;
2005-04-17 02:20:36 +04:00
if ( error )
2022-09-06 22:49:06 +03:00
goto start_split_failed ;
2022-09-06 22:48:52 +03:00
2023-01-20 19:26:36 +03:00
vma = vma_iter_load ( vmi ) ;
2005-04-17 02:20:36 +04:00
}
2023-01-20 19:26:13 +03:00
prev = vma_prev ( vmi ) ;
2022-09-06 22:49:06 +03:00
if ( unlikely ( ( ! prev ) ) )
2023-01-20 19:26:13 +03:00
vma_iter_set ( vmi , start ) ;
2022-09-06 22:49:06 +03:00
/*
* Detach a range of VMAs from the mm . Using next as a temp variable as
* it is always overwritten .
*/
2023-01-20 19:26:13 +03:00
for_each_vma_range ( * vmi , next , end ) {
2022-09-06 22:49:06 +03:00
/* Does it split the end? */
if ( next - > vm_end > end ) {
2023-01-20 19:26:39 +03:00
error = __split_vma ( vmi , next , end , 0 ) ;
2022-09-06 22:49:06 +03:00
if ( error )
goto end_split_failed ;
}
2023-06-18 03:47:08 +03:00
vma_start_write ( next ) ;
mas_set_range ( & mas_detach , next - > vm_start , next - > vm_end - 1 ) ;
2023-06-28 12:55:03 +03:00
error = mas_store_gfp ( & mas_detach , next , GFP_KERNEL ) ;
if ( error )
2023-06-18 03:47:08 +03:00
goto munmap_gather_failed ;
vma_mark_detached ( next , true ) ;
if ( next - > vm_flags & VM_LOCKED )
locked_vm + = vma_pages ( next ) ;
2022-09-06 22:48:52 +03:00
2022-09-06 22:49:06 +03:00
count + + ;
2023-06-01 04:54:02 +03:00
if ( unlikely ( uf ) ) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
* will remain split , but userland will get a
* highly unexpected error anyway . This is no
* different than the case where the first of the two
* __split_vma fails , but we don ' t undo the first
* split , despite we could . This is unlikely enough
* failure that it ' s not worth optimizing it for .
*/
error = userfaultfd_unmap_prep ( next , start , end , uf ) ;
if ( error )
goto userfaultfd_error ;
}
2022-09-06 22:49:06 +03:00
# ifdef CONFIG_DEBUG_VM_MAPLE_TREE
BUG_ON ( next - > vm_start < start ) ;
BUG_ON ( next - > vm_start > end ) ;
# endif
2005-04-17 02:20:36 +04:00
}
2023-05-18 17:55:31 +03:00
if ( vma_iter_end ( vmi ) > end )
next = vma_iter_load ( vmi ) ;
2022-09-06 22:48:52 +03:00
2023-05-18 17:55:31 +03:00
if ( ! next )
next = vma_next ( vmi ) ;
2017-09-07 02:23:53 +03:00
2022-09-06 22:49:06 +03:00
# if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
{
MA_STATE ( test , & mt_detach , start , end - 1 ) ;
struct vm_area_struct * vma_mas , * vma_test ;
int test_count = 0 ;
2023-01-20 19:26:13 +03:00
vma_iter_set ( vmi , start ) ;
2022-09-06 22:49:06 +03:00
rcu_read_lock ( ) ;
vma_test = mas_find ( & test , end - 1 ) ;
2023-01-20 19:26:13 +03:00
for_each_vma_range ( * vmi , vma_mas , end ) {
2022-09-06 22:49:06 +03:00
BUG_ON ( vma_mas ! = vma_test ) ;
test_count + + ;
vma_test = mas_next ( & test , end - 1 ) ;
}
rcu_read_unlock ( ) ;
BUG_ON ( count ! = test_count ) ;
}
# endif
2023-01-20 19:26:13 +03:00
vma_iter_set ( vmi , start ) ;
2023-06-28 12:55:03 +03:00
error = vma_iter_clear_gfp ( vmi , start , end , GFP_KERNEL ) ;
if ( error )
2023-06-18 03:47:08 +03:00
goto clear_tree_failed ;
2023-01-20 19:26:12 +03:00
2023-06-28 12:55:03 +03:00
/* Point of no return */
2023-06-18 03:47:08 +03:00
mm - > locked_vm - = locked_vm ;
2022-09-06 22:49:06 +03:00
mm - > map_count - = count ;
2023-06-30 05:28:16 +03:00
if ( unlock )
2023-06-29 22:14:14 +03:00
mmap_write_downgrade ( mm ) ;
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
2023-01-26 22:37:51 +03:00
/*
* We can free page tables without write - locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock .
*/
2023-06-30 05:28:16 +03:00
unmap_region ( mm , & mt_detach , vma , prev , next , start , end , ! unlock ) ;
2022-09-06 22:49:06 +03:00
/* Statistics and freeing VMAs */
mas_set ( & mas_detach , start ) ;
remove_mt ( mm , & mas_detach ) ;
__mt_destroy ( & mt_detach ) ;
2023-07-03 20:08:50 +03:00
validate_mm ( mm ) ;
2023-06-30 05:28:16 +03:00
if ( unlock )
mmap_read_unlock ( mm ) ;
2005-04-17 02:20:36 +04:00
2023-06-30 05:28:16 +03:00
return 0 ;
2022-09-06 22:48:45 +03:00
2023-06-18 03:47:08 +03:00
clear_tree_failed :
2022-09-06 22:48:45 +03:00
userfaultfd_error :
2023-06-18 03:47:08 +03:00
munmap_gather_failed :
2022-09-06 22:49:06 +03:00
end_split_failed :
2023-06-18 03:47:08 +03:00
mas_set ( & mas_detach , 0 ) ;
mas_for_each ( & mas_detach , next , end )
vma_mark_detached ( next , false ) ;
2022-09-06 22:49:06 +03:00
__mt_destroy ( & mt_detach ) ;
start_split_failed :
map_count_exceeded :
2023-07-04 05:29:48 +03:00
validate_mm ( mm ) ;
2022-09-06 22:48:45 +03:00
return error ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:48:52 +03:00
/*
2023-01-20 19:26:13 +03:00
* do_vmi_munmap ( ) - munmap a given range .
* @ vmi : The vma iterator
2022-09-06 22:48:52 +03:00
* @ mm : The mm_struct
* @ start : The start address to munmap
* @ len : The length of the range to munmap
* @ uf : The userfaultfd list_head
2023-06-30 05:28:16 +03:00
* @ unlock : set to true if the user wants to drop the mmap_lock on success
2022-09-06 22:48:52 +03:00
*
* This function takes a @ mas that is either pointing to the previous VMA or set
* to MA_START and sets it up to remove the mapping ( s ) . The @ len will be
* aligned and any arch_unmap work will be preformed .
*
2023-06-30 05:28:16 +03:00
* Return : 0 on success and drops the lock if so directed , error and leaves the
* lock held otherwise .
2022-09-06 22:48:52 +03:00
*/
2023-01-20 19:26:13 +03:00
int do_vmi_munmap ( struct vma_iterator * vmi , struct mm_struct * mm ,
2022-09-06 22:48:52 +03:00
unsigned long start , size_t len , struct list_head * uf ,
2023-06-30 05:28:16 +03:00
bool unlock )
2022-09-06 22:48:52 +03:00
{
unsigned long end ;
struct vm_area_struct * vma ;
if ( ( offset_in_page ( start ) ) | | start > TASK_SIZE | | len > TASK_SIZE - start )
return - EINVAL ;
end = start + PAGE_ALIGN ( len ) ;
if ( end = = start )
return - EINVAL ;
/* arch_unmap() might do unmaps itself. */
arch_unmap ( mm , start , end ) ;
/* Find the first overlapping VMA */
2023-01-20 19:26:13 +03:00
vma = vma_find ( vmi , end ) ;
2023-06-30 05:28:16 +03:00
if ( ! vma ) {
if ( unlock )
mmap_write_unlock ( mm ) ;
2022-09-06 22:48:52 +03:00
return 0 ;
2023-06-30 05:28:16 +03:00
}
2022-09-06 22:48:52 +03:00
2023-06-30 05:28:16 +03:00
return do_vmi_align_munmap ( vmi , vma , mm , start , end , uf , unlock ) ;
2022-09-06 22:48:52 +03:00
}
/* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls.
* @ mm : The mm_struct
* @ start : The start address to munmap
* @ len : The length to be munmapped .
* @ uf : The userfaultfd list_head
2023-06-30 05:28:16 +03:00
*
* Return : 0 on success , error otherwise .
2022-09-06 22:48:52 +03:00
*/
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
int do_munmap ( struct mm_struct * mm , unsigned long start , size_t len ,
struct list_head * uf )
{
2023-01-20 19:26:13 +03:00
VMA_ITERATOR ( vmi , mm , start ) ;
2022-09-06 22:48:52 +03:00
2023-01-20 19:26:13 +03:00
return do_vmi_munmap ( & vmi , mm , start , len , uf , false ) ;
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
}
2022-09-06 22:48:52 +03:00
unsigned long mmap_region ( struct file * file , unsigned long addr ,
unsigned long len , vm_flags_t vm_flags , unsigned long pgoff ,
struct list_head * uf )
{
struct mm_struct * mm = current - > mm ;
struct vm_area_struct * vma = NULL ;
struct vm_area_struct * next , * prev , * merge ;
pgoff_t pglen = len > > PAGE_SHIFT ;
unsigned long charged = 0 ;
unsigned long end = addr + len ;
unsigned long merge_start = addr , merge_end = end ;
pgoff_t vm_pgoff ;
int error ;
2023-01-20 19:26:13 +03:00
VMA_ITERATOR ( vmi , mm , addr ) ;
2022-09-06 22:48:52 +03:00
/* Check against address space limit. */
if ( ! may_expand_vm ( mm , vm_flags , len > > PAGE_SHIFT ) ) {
unsigned long nr_pages ;
/*
* MAP_FIXED may remove pages of mappings that intersects with
* requested mapping . Account for the pages it would unmap .
*/
nr_pages = count_vma_pages_range ( mm , addr , end ) ;
if ( ! may_expand_vm ( mm , vm_flags ,
( len > > PAGE_SHIFT ) - nr_pages ) )
return - ENOMEM ;
}
/* Unmap any existing mapping in the area */
2023-01-20 19:26:13 +03:00
if ( do_vmi_munmap ( & vmi , mm , addr , len , uf , false ) )
2022-09-06 22:48:52 +03:00
return - ENOMEM ;
/*
* Private writable mapping : check memory availability
*/
if ( accountable_mapping ( file , vm_flags ) ) {
charged = len > > PAGE_SHIFT ;
if ( security_vm_enough_memory_mm ( mm , charged ) )
return - ENOMEM ;
vm_flags | = VM_ACCOUNT ;
}
2023-01-20 19:26:13 +03:00
next = vma_next ( & vmi ) ;
prev = vma_prev ( & vmi ) ;
2022-09-06 22:48:52 +03:00
if ( vm_flags & VM_SPECIAL )
goto cannot_expand ;
/* Attempt to expand an old mapping */
/* Check next */
if ( next & & next - > vm_start = = end & & ! vma_policy ( next ) & &
can_vma_merge_before ( next , vm_flags , NULL , file , pgoff + pglen ,
NULL_VM_UFFD_CTX , NULL ) ) {
merge_end = next - > vm_end ;
vma = next ;
vm_pgoff = next - > vm_pgoff - pglen ;
}
/* Check prev */
if ( prev & & prev - > vm_end = = addr & & ! vma_policy ( prev ) & &
( vma ? can_vma_merge_after ( prev , vm_flags , vma - > anon_vma , file ,
pgoff , vma - > vm_userfaultfd_ctx , NULL ) :
can_vma_merge_after ( prev , vm_flags , NULL , file , pgoff ,
NULL_VM_UFFD_CTX , NULL ) ) ) {
merge_start = prev - > vm_start ;
vma = prev ;
vm_pgoff = prev - > vm_pgoff ;
}
/* Actually expand, if possible */
if ( vma & &
2023-01-20 19:26:14 +03:00
! vma_expand ( & vmi , vma , merge_start , merge_end , vm_pgoff , next ) ) {
2022-09-06 22:48:52 +03:00
khugepaged_enter_vma ( vma , vm_flags ) ;
goto expanded ;
}
cannot_expand :
2023-05-18 17:55:44 +03:00
if ( prev )
vma_iter_next_range ( & vmi ) ;
2022-09-06 22:48:52 +03:00
/*
* Determine the object being mapped and call the appropriate
* specific mapper . the address has already been validated , but
* not unmapped , but the maps are removed from the list .
*/
vma = vm_area_alloc ( mm ) ;
if ( ! vma ) {
error = - ENOMEM ;
goto unacct_error ;
}
2023-01-20 19:26:36 +03:00
vma_iter_set ( & vmi , addr ) ;
2022-09-06 22:48:52 +03:00
vma - > vm_start = addr ;
vma - > vm_end = end ;
2023-01-26 22:37:49 +03:00
vm_flags_init ( vma , vm_flags ) ;
2022-09-06 22:48:52 +03:00
vma - > vm_page_prot = vm_get_page_prot ( vm_flags ) ;
vma - > vm_pgoff = pgoff ;
if ( file ) {
if ( vm_flags & VM_SHARED ) {
error = mapping_map_writable ( file - > f_mapping ) ;
if ( error )
goto free_vma ;
}
vma - > vm_file = get_file ( file ) ;
error = call_mmap ( file , vma ) ;
if ( error )
goto unmap_and_free_vma ;
2022-10-18 22:17:12 +03:00
/*
* Expansion is handled above , merging is handled below .
* Drivers should not alter the address of the VMA .
2022-09-06 22:48:52 +03:00
*/
2023-01-20 19:26:38 +03:00
error = - EINVAL ;
if ( WARN_ON ( ( addr ! = vma - > vm_start ) ) )
2022-10-18 22:17:12 +03:00
goto close_and_free_vma ;
2022-09-06 22:48:52 +03:00
2023-01-20 19:26:38 +03:00
vma_iter_set ( & vmi , addr ) ;
2022-09-06 22:48:52 +03:00
/*
* If vm_flags changed after call_mmap ( ) , we should try merge
* vma again as we may succeed this time .
*/
if ( unlikely ( vm_flags ! = vma - > vm_flags & & prev ) ) {
2023-01-20 19:26:30 +03:00
merge = vma_merge ( & vmi , mm , prev , vma - > vm_start ,
vma - > vm_end , vma - > vm_flags , NULL ,
vma - > vm_file , vma - > vm_pgoff , NULL ,
NULL_VM_UFFD_CTX , NULL ) ;
2022-09-06 22:48:52 +03:00
if ( merge ) {
/*
* - > mmap ( ) can change vma - > vm_file and fput
* the original file . So fput the vma - > vm_file
* here or we would add an extra fput for file
* and cause general protection fault
* ultimately .
*/
fput ( vma - > vm_file ) ;
vm_area_free ( vma ) ;
vma = merge ;
/* Update vm_flags to pick up the change. */
vm_flags = vma - > vm_flags ;
goto unmap_writable ;
}
}
vm_flags = vma - > vm_flags ;
} else if ( vm_flags & VM_SHARED ) {
error = shmem_zero_setup ( vma ) ;
if ( error )
goto free_vma ;
} else {
vma_set_anonymous ( vma ) ;
}
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only. Therefore the filter simply rejects any
mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor. The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].
This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE |
PROT_EXEC mappings. Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings. However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
Similarly, mprotect() would return -EACCESS below:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
where PROT_BTI enables branch tracking identification on arm64.
Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com
Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: nd <nd@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19 19:03:43 +03:00
if ( map_deny_write_exec ( vma , vma - > vm_flags ) ) {
error = - EACCES ;
2023-03-08 22:04:20 +03:00
goto close_and_free_vma ;
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only. Therefore the filter simply rejects any
mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor. The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].
This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE |
PROT_EXEC mappings. Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings. However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
Similarly, mprotect() would return -EACCESS below:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:
addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
where PROT_BTI enables branch tracking identification on arm64.
Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com
Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: nd <nd@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-19 19:03:43 +03:00
}
2022-09-06 22:48:52 +03:00
/* Allow architectures to sanity-check the vm_flags */
2023-01-20 19:26:38 +03:00
error = - EINVAL ;
if ( ! arch_validate_flags ( vma - > vm_flags ) )
goto close_and_free_vma ;
2022-09-06 22:48:52 +03:00
2023-01-20 19:26:38 +03:00
error = - ENOMEM ;
if ( vma_iter_prealloc ( & vmi ) )
goto close_and_free_vma ;
2022-09-06 22:48:52 +03:00
2023-07-09 02:04:00 +03:00
/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write ( vma ) ;
2023-01-20 19:26:13 +03:00
vma_iter_store ( & vmi , vma ) ;
2022-09-06 22:48:52 +03:00
mm - > map_count + + ;
if ( vma - > vm_file ) {
mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary".
In do_mmap,it will call mmap_region to create vma node, initialize it and
insert it to vma maintain structure in mm_struct and i_mmap tree of the
mapping file, then increase map_count to record the number of vma nodes
used. The hot osq_lock is to protect operations on file's i_mmap tree.
For the mm_struct member change like vma insertion and map_count update,
they do not affect i_mmap tree. Move those operations out of the lock's
critical section, to reduce hold time on the lock.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.5-rc1 due to regression of this benchmark
from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below: Link:
https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 17:57:39 +03:00
i_mmap_lock_write ( vma - > vm_file - > f_mapping ) ;
2022-09-06 22:48:52 +03:00
if ( vma - > vm_flags & VM_SHARED )
mapping_allow_writable ( vma - > vm_file - > f_mapping ) ;
flush_dcache_mmap_lock ( vma - > vm_file - > f_mapping ) ;
vma_interval_tree_insert ( vma , & vma - > vm_file - > f_mapping - > i_mmap ) ;
flush_dcache_mmap_unlock ( vma - > vm_file - > f_mapping ) ;
i_mmap_unlock_write ( vma - > vm_file - > f_mapping ) ;
}
/*
* vma_merge ( ) calls khugepaged_enter_vma ( ) either , the below
* call covers the non - merge case .
*/
khugepaged_enter_vma ( vma , vma - > vm_flags ) ;
/* Once vma denies write, undo our temporary denial count */
unmap_writable :
if ( file & & vm_flags & VM_SHARED )
mapping_unmap_writable ( file - > f_mapping ) ;
file = vma - > vm_file ;
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 08:13:40 +03:00
ksm_add_vma ( vma ) ;
2022-09-06 22:48:52 +03:00
expanded :
perf_event_mmap ( vma ) ;
vm_stat_account ( mm , vm_flags , len > > PAGE_SHIFT ) ;
if ( vm_flags & VM_LOCKED ) {
if ( ( vm_flags & VM_SPECIAL ) | | vma_is_dax ( vma ) | |
is_vm_hugetlb_page ( vma ) | |
vma = = get_gate_vma ( current - > mm ) )
2023-01-26 22:37:48 +03:00
vm_flags_clear ( vma , VM_LOCKED_MASK ) ;
2022-09-06 22:48:52 +03:00
else
mm - > locked_vm + = ( len > > PAGE_SHIFT ) ;
}
if ( file )
uprobe_mmap ( vma ) ;
/*
* New ( or expanded ) vma always get soft dirty status .
* Otherwise user - space soft - dirty page tracker won ' t
* be able to distinguish situation when vma area unmapped ,
* then new mapped in - place ( which must be aimed as
* a completely new data area ) .
*/
2023-01-26 22:37:49 +03:00
vm_flags_set ( vma , VM_SOFTDIRTY ) ;
2022-09-06 22:48:52 +03:00
vma_set_page_prot ( vma ) ;
validate_mm ( mm ) ;
return addr ;
mm/mmap: undo ->mmap() when arch_validate_flags() fails
Commit c462ac288f2c ("mm: Introduce arch_validate_flags()") added a late
check in mmap_region() to let architectures validate vm_flags. The check
needs to happen after calling ->mmap() as the flags can potentially be
modified during this callback.
If arch_validate_flags() check fails we unmap and free the vma. However,
the error path fails to undo the ->mmap() call that previously succeeded
and depending on the specific ->mmap() implementation this translates to
reference increments, memory allocations and other operations what will
not be cleaned up.
There are several places (mainly device drivers) where this is an issue.
However, one specific example is bpf_map_mmap() which keeps count of the
mappings in map->writecnt. The count is incremented on ->mmap() and then
decremented on vm_ops->close(). When arch_validate_flags() fails this
count is off since bpf_map_mmap_close() is never called.
One can reproduce this issue in arm64 devices with MTE support. Here the
vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
previously. From userspace then is enough to pass the PROT_MTE flag to
mmap() syscall to trigger the arch_validate_flags() failure.
The following program reproduces this issue:
#include <stdio.h>
#include <unistd.h>
#include <linux/unistd.h>
#include <linux/bpf.h>
#include <sys/mman.h>
int main(void)
{
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(long long),
.max_entries = 256,
.map_flags = BPF_F_MMAPABLE,
};
int fd;
fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
return 0;
}
By manually adding some log statements to the vm_ops callbacks we can
confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
->release():
With PROT_MTE flag:
root@debian:~# ./bpf-test
[ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1
[ 111.288763] bpf_map_release: map=9 writecnt=1
Without PROT_MTE flag:
root@debian:~# ./bpf-test
[ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1
[ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0
[ 157.832396] bpf_map_release: map=10 writecnt=0
This patch fixes the above issue by calling vm_ops->close() when the
arch_validate_flags() check fails, after this we can proceed to unmap and
free the vma on the error path.
Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()")
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-30 03:38:43 +03:00
close_and_free_vma :
2023-01-20 19:26:38 +03:00
if ( file & & vma - > vm_ops & & vma - > vm_ops - > close )
mm/mmap: undo ->mmap() when arch_validate_flags() fails
Commit c462ac288f2c ("mm: Introduce arch_validate_flags()") added a late
check in mmap_region() to let architectures validate vm_flags. The check
needs to happen after calling ->mmap() as the flags can potentially be
modified during this callback.
If arch_validate_flags() check fails we unmap and free the vma. However,
the error path fails to undo the ->mmap() call that previously succeeded
and depending on the specific ->mmap() implementation this translates to
reference increments, memory allocations and other operations what will
not be cleaned up.
There are several places (mainly device drivers) where this is an issue.
However, one specific example is bpf_map_mmap() which keeps count of the
mappings in map->writecnt. The count is incremented on ->mmap() and then
decremented on vm_ops->close(). When arch_validate_flags() fails this
count is off since bpf_map_mmap_close() is never called.
One can reproduce this issue in arm64 devices with MTE support. Here the
vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
previously. From userspace then is enough to pass the PROT_MTE flag to
mmap() syscall to trigger the arch_validate_flags() failure.
The following program reproduces this issue:
#include <stdio.h>
#include <unistd.h>
#include <linux/unistd.h>
#include <linux/bpf.h>
#include <sys/mman.h>
int main(void)
{
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(long long),
.max_entries = 256,
.map_flags = BPF_F_MMAPABLE,
};
int fd;
fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
return 0;
}
By manually adding some log statements to the vm_ops callbacks we can
confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
->release():
With PROT_MTE flag:
root@debian:~# ./bpf-test
[ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1
[ 111.288763] bpf_map_release: map=9 writecnt=1
Without PROT_MTE flag:
root@debian:~# ./bpf-test
[ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1
[ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0
[ 157.832396] bpf_map_release: map=10 writecnt=0
This patch fixes the above issue by calling vm_ops->close() when the
arch_validate_flags() check fails, after this we can proceed to unmap and
free the vma on the error path.
Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()")
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-30 03:38:43 +03:00
vma - > vm_ops - > close ( vma ) ;
2023-01-20 19:26:38 +03:00
if ( file | | vma - > vm_file ) {
2022-09-06 22:48:52 +03:00
unmap_and_free_vma :
2023-01-20 19:26:38 +03:00
fput ( vma - > vm_file ) ;
vma - > vm_file = NULL ;
2022-09-06 22:48:52 +03:00
2023-01-20 19:26:38 +03:00
/* Undo any partial mapping done by a device driver. */
unmap_region ( mm , & mm - > mm_mt , vma , prev , next , vma - > vm_start ,
2023-01-26 22:37:51 +03:00
vma - > vm_end , true ) ;
2023-01-20 19:26:38 +03:00
}
2022-10-28 10:37:17 +03:00
if ( file & & ( vm_flags & VM_SHARED ) )
2022-09-06 22:48:52 +03:00
mapping_unmap_writable ( file - > f_mapping ) ;
free_vma :
vm_area_free ( vma ) ;
unacct_error :
if ( charged )
vm_unacct_memory ( charged ) ;
validate_mm ( mm ) ;
return error ;
}
2023-06-30 05:28:16 +03:00
static int __vm_munmap ( unsigned long start , size_t len , bool unlock )
2005-04-17 02:20:36 +04:00
{
int ret ;
2012-04-21 05:57:04 +04:00
struct mm_struct * mm = current - > mm ;
2017-02-25 01:58:22 +03:00
LIST_HEAD ( uf ) ;
2023-01-20 19:26:13 +03:00
VMA_ITERATOR ( vmi , mm , start ) ;
2005-04-17 02:20:36 +04:00
2020-06-09 07:33:25 +03:00
if ( mmap_write_lock_killable ( mm ) )
2016-05-24 02:25:33 +03:00
return - EINTR ;
2023-06-30 05:28:16 +03:00
ret = do_vmi_munmap ( & vmi , mm , start , len , & uf , unlock ) ;
if ( ret | | ! unlock )
2020-06-09 07:33:25 +03:00
mmap_write_unlock ( mm ) ;
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
2017-02-25 01:58:22 +03:00
userfaultfd_unmap_complete ( mm , & uf ) ;
2005-04-17 02:20:36 +04:00
return ret ;
}
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
int vm_munmap ( unsigned long start , size_t len )
{
return __vm_munmap ( start , len , false ) ;
}
2012-04-21 03:20:01 +04:00
EXPORT_SYMBOL ( vm_munmap ) ;
SYSCALL_DEFINE2 ( munmap , unsigned long , addr , size_t , len )
{
2019-09-26 02:49:04 +03:00
addr = untagged_addr ( addr ) ;
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.
Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.
History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great
feedback and suggestions.
Then this topic was discussed on LSFMM summit 2018. In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach. Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)
Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
* The unexpected state from PF if it wins the race in the middle of munmap.
It may return zero page, instead of the content or SIGSEGV.
* Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
is a showstopper from akpm
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer. So, uprobe unmap will not be handled by the regular
path.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
This patch (of 3):
When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle. When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).
Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to
cleanup vmas.
But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
acquire write mmap_sem
lookup vmas (find and split vmas)
deal with special mappings
detach vmas
downgrade_write
zap pages
free page tables
release mmap_sem
The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e. page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.
If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings. They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.
But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.
Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.
With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.
And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.
For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle. But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization. So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them. mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.
With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.
munmap_test-15002 [008] 594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us
| unmap_region();
munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us
| }
Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.
[1] https://lwn.net/Articles/753269/
Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:07:11 +03:00
return __vm_munmap ( addr , len , true ) ;
2012-04-21 03:20:01 +04:00
}
2005-04-17 02:20:36 +04:00
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
/*
* Emulation of deprecated remap_file_pages ( ) syscall .
*/
SYSCALL_DEFINE5 ( remap_file_pages , unsigned long , start , unsigned long , size ,
unsigned long , prot , unsigned long , pgoff , unsigned long , flags )
{
struct mm_struct * mm = current - > mm ;
struct vm_area_struct * vma ;
unsigned long populate = 0 ;
unsigned long ret = - EINVAL ;
struct file * file ;
2022-06-27 09:00:26 +03:00
pr_warn_once ( " %s (%d) uses deprecated remap_file_pages() syscall. See Documentation/mm/remap_file_pages.rst. \n " ,
2016-03-18 00:19:47 +03:00
current - > comm , current - > pid ) ;
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
if ( prot )
return ret ;
start = start & PAGE_MASK ;
size = size & PAGE_MASK ;
if ( start + size < = start )
return ret ;
/* Does pgoff wrap? */
if ( pgoff + ( size > > PAGE_SHIFT ) < pgoff )
return ret ;
2020-06-09 07:33:25 +03:00
if ( mmap_write_lock_killable ( mm ) )
2016-05-24 02:25:27 +03:00
return - EINTR ;
2021-09-03 00:56:49 +03:00
vma = vma_lookup ( mm , start ) ;
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
if ( ! vma | | ! ( vma - > vm_flags & VM_SHARED ) )
goto out ;
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
if ( start + size > vma - > vm_end ) {
2022-09-06 22:49:06 +03:00
VMA_ITERATOR ( vmi , mm , vma - > vm_end ) ;
struct vm_area_struct * next , * prev = vma ;
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
2022-09-06 22:49:06 +03:00
for_each_vma_range ( vmi , next , start + size ) {
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
/* hole between vmas ? */
2022-09-06 22:49:06 +03:00
if ( next - > vm_start ! = prev - > vm_end )
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
goto out ;
if ( next - > vm_file ! = vma - > vm_file )
goto out ;
if ( next - > vm_flags ! = vma - > vm_flags )
goto out ;
2022-10-25 19:12:49 +03:00
if ( start + size < = next - > vm_end )
break ;
2022-09-06 22:49:06 +03:00
prev = next ;
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
}
if ( ! next )
goto out ;
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
}
prot | = vma - > vm_flags & VM_READ ? PROT_READ : 0 ;
prot | = vma - > vm_flags & VM_WRITE ? PROT_WRITE : 0 ;
prot | = vma - > vm_flags & VM_EXEC ? PROT_EXEC : 0 ;
flags & = MAP_NONBLOCK ;
flags | = MAP_SHARED | MAP_FIXED | MAP_POPULATE ;
2021-05-05 04:38:06 +03:00
if ( vma - > vm_flags & VM_LOCKED )
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
flags | = MAP_LOCKED ;
mm: fix regression in remap_file_pages() emulation
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 00:11:15 +03:00
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
file = get_file ( vma - > vm_file ) ;
2020-08-07 09:23:37 +03:00
ret = do_mmap ( vma - > vm_file , start , size ,
2017-02-25 01:58:22 +03:00
prot , flags , pgoff , & populate , NULL ) ;
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
fput ( file ) ;
out :
2020-06-09 07:33:25 +03:00
mmap_write_unlock ( mm ) ;
mm: replace remap_file_pages() syscall with emulation
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.
The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 01:09:46 +03:00
if ( populate )
mm_populate ( ret , populate ) ;
if ( ! IS_ERR_VALUE ( ret ) )
ret = 0 ;
return ret ;
}
2005-04-17 02:20:36 +04:00
/*
2023-01-27 00:20:49 +03:00
* do_vma_munmap ( ) - Unmap a full or partial vma .
* @ vmi : The vma iterator pointing at the vma
* @ vma : The first vma to be munmapped
* @ start : the start of the address to unmap
* @ end : The end of the address to unmap
2022-09-06 22:48:50 +03:00
* @ uf : The userfaultfd list_head
2023-06-30 05:28:16 +03:00
* @ unlock : Drop the lock on success
2022-09-06 22:48:50 +03:00
*
2023-01-27 00:20:49 +03:00
* unmaps a VMA mapping when the vma iterator is already in position .
* Does not handle alignment .
2023-06-30 05:28:16 +03:00
*
* Return : 0 on success drops the lock of so directed , error on failure and will
* still hold the lock .
2005-04-17 02:20:36 +04:00
*/
2023-01-27 00:20:49 +03:00
int do_vma_munmap ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
2023-06-30 05:28:16 +03:00
unsigned long start , unsigned long end , struct list_head * uf ,
bool unlock )
2005-04-17 02:20:36 +04:00
{
2022-09-06 22:48:50 +03:00
struct mm_struct * mm = vma - > vm_mm ;
2006-09-07 14:17:04 +04:00
2023-01-27 00:20:49 +03:00
arch_unmap ( mm , start , end ) ;
2023-07-04 05:29:48 +03:00
return do_vmi_align_munmap ( vmi , vma , mm , start , end , uf , unlock ) ;
2022-09-06 22:48:50 +03:00
}
2005-04-17 02:20:36 +04:00
2022-09-06 22:48:50 +03:00
/*
* do_brk_flags ( ) - Increase the brk vma if the flags match .
2023-01-20 19:26:09 +03:00
* @ vmi : The vma iterator
2022-09-06 22:48:50 +03:00
* @ addr : The start address
* @ len : The length of the increase
* @ vma : The vma ,
* @ flags : The VMA Flags
*
* Extend the brk VMA from addr to addr + len . If the VMA is NULL or the flags
* do not match then create a new anonymous VMA . Eventually we may be able to
* do some brk - specific accounting here .
*/
2023-01-20 19:26:09 +03:00
static int do_brk_flags ( struct vma_iterator * vmi , struct vm_area_struct * vma ,
2022-09-06 22:49:06 +03:00
unsigned long addr , unsigned long len , unsigned long flags )
2022-09-06 22:48:50 +03:00
{
struct mm_struct * mm = current - > mm ;
2023-01-20 19:26:48 +03:00
struct vma_prepare vp ;
2005-04-17 02:20:36 +04:00
2022-09-06 22:48:50 +03:00
/*
* Check against address space limits by the changed size
* Note : This happens * after * clearing old mappings in some code paths .
*/
flags | = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm - > def_flags ;
2016-01-15 02:22:07 +03:00
if ( ! may_expand_vm ( mm , flags , len > > PAGE_SHIFT ) )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
if ( mm - > map_count > sysctl_max_map_count )
return - ENOMEM ;
2012-02-13 07:58:52 +04:00
if ( security_vm_enough_memory_mm ( mm , len > > PAGE_SHIFT ) )
2005-04-17 02:20:36 +04:00
return - ENOMEM ;
/*
2022-09-06 22:48:50 +03:00
* Expand the existing vma if possible ; Note that singular lists do not
* occur after forking , so the expand will only happen on new VMAs .
2005-04-17 02:20:36 +04:00
*/
2022-12-05 22:23:17 +03:00
if ( vma & & vma - > vm_end = = addr & & ! vma_policy ( vma ) & &
can_vma_merge_after ( vma , flags , NULL , NULL ,
addr > > PAGE_SHIFT , NULL_VM_UFFD_CTX , NULL ) ) {
2023-01-20 19:26:09 +03:00
if ( vma_iter_prealloc ( vmi ) )
2022-12-02 07:53:39 +03:00
goto unacct_fail ;
2022-10-11 19:08:37 +03:00
2023-01-20 19:26:48 +03:00
init_vma_prep ( & vp , vma ) ;
vma_prepare ( & vp ) ;
2023-02-27 20:36:13 +03:00
vma_adjust_trans_huge ( vma , vma - > vm_start , addr + len , 0 ) ;
2022-09-06 22:48:50 +03:00
vma - > vm_end = addr + len ;
2023-01-26 22:37:49 +03:00
vm_flags_set ( vma , VM_SOFTDIRTY ) ;
2023-01-20 19:26:09 +03:00
vma_iter_store ( vmi , vma ) ;
2022-09-06 22:48:50 +03:00
2023-01-20 19:26:48 +03:00
vma_complete ( & vp , vmi , mm ) ;
2022-09-06 22:48:50 +03:00
khugepaged_enter_vma ( vma , flags ) ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2022-09-06 22:48:50 +03:00
/* create a vma struct for an anonymous mapping */
vma = vm_area_alloc ( mm ) ;
if ( ! vma )
2022-12-02 07:53:39 +03:00
goto unacct_fail ;
2005-04-17 02:20:36 +04:00
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-27 02:37:35 +03:00
vma_set_anonymous ( vma ) ;
2005-04-17 02:20:36 +04:00
vma - > vm_start = addr ;
vma - > vm_end = addr + len ;
2022-09-06 22:48:50 +03:00
vma - > vm_pgoff = addr > > PAGE_SHIFT ;
2023-01-26 22:37:49 +03:00
vm_flags_init ( vma , flags ) ;
2007-10-19 10:39:15 +04:00
vma - > vm_page_prot = vm_get_page_prot ( flags ) ;
2023-01-20 19:26:09 +03:00
if ( vma_iter_store_gfp ( vmi , vma , GFP_KERNEL ) )
2022-09-06 22:48:50 +03:00
goto mas_store_fail ;
2022-09-06 22:48:45 +03:00
2022-09-06 22:48:50 +03:00
mm - > map_count + + ;
2023-07-14 22:55:48 +03:00
validate_mm ( mm ) ;
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 08:13:40 +03:00
ksm_add_vma ( vma ) ;
2005-04-17 02:20:36 +04:00
out :
2010-05-18 18:30:49 +04:00
perf_event_mmap ( vma ) ;
2005-04-17 02:20:36 +04:00
mm - > total_vm + = len > > PAGE_SHIFT ;
2016-01-15 02:22:07 +03:00
mm - > data_vm + = len > > PAGE_SHIFT ;
2013-02-23 04:32:40 +04:00
if ( flags & VM_LOCKED )
mm - > locked_vm + = ( len > > PAGE_SHIFT ) ;
2023-01-26 22:37:49 +03:00
vm_flags_set ( vma , VM_SOFTDIRTY ) ;
2016-05-28 01:57:31 +03:00
return 0 ;
2022-09-06 22:48:45 +03:00
2022-09-06 22:48:50 +03:00
mas_store_fail :
2022-09-06 22:48:45 +03:00
vm_area_free ( vma ) ;
2022-12-02 07:53:39 +03:00
unacct_fail :
2022-09-06 22:48:50 +03:00
vm_unacct_memory ( len > > PAGE_SHIFT ) ;
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
}
2018-07-14 02:59:20 +03:00
int vm_brk_flags ( unsigned long addr , unsigned long request , unsigned long flags )
2012-04-21 02:35:40 +04:00
{
struct mm_struct * mm = current - > mm ;
2022-09-06 22:48:50 +03:00
struct vm_area_struct * vma = NULL ;
2018-07-14 02:59:20 +03:00
unsigned long len ;
2016-05-28 01:57:31 +03:00
int ret ;
2013-02-23 04:32:40 +04:00
bool populate ;
2017-02-25 01:58:22 +03:00
LIST_HEAD ( uf ) ;
2023-01-20 19:26:09 +03:00
VMA_ITERATOR ( vmi , mm , addr ) ;
2012-04-21 02:35:40 +04:00
2018-07-14 02:59:20 +03:00
len = PAGE_ALIGN ( request ) ;
if ( len < request )
return - ENOMEM ;
if ( ! len )
return 0 ;
2020-06-09 07:33:25 +03:00
if ( mmap_write_lock_killable ( mm ) )
2016-05-24 02:25:42 +03:00
return - EINTR ;
2022-09-06 22:48:50 +03:00
/* Until we need other flags, refuse anything except VM_EXEC. */
if ( ( flags & ( ~ VM_EXEC ) ) ! = 0 )
return - EINVAL ;
ret = check_brk_limits ( addr , len ) ;
if ( ret )
goto limits_failed ;
2023-01-20 19:26:13 +03:00
ret = do_vmi_munmap ( & vmi , mm , addr , len , & uf , 0 ) ;
2022-09-06 22:48:50 +03:00
if ( ret )
goto munmap_failed ;
2023-01-20 19:26:09 +03:00
vma = vma_prev ( & vmi ) ;
ret = do_brk_flags ( & vmi , vma , addr , len , flags ) ;
2013-02-23 04:32:40 +04:00
populate = ( ( mm - > def_flags & VM_LOCKED ) ! = 0 ) ;
2020-06-09 07:33:25 +03:00
mmap_write_unlock ( mm ) ;
2017-02-25 01:58:22 +03:00
userfaultfd_unmap_complete ( mm , & uf ) ;
2016-05-28 01:57:31 +03:00
if ( populate & & ! ret )
2013-02-23 04:32:40 +04:00
mm_populate ( addr , len ) ;
2012-04-21 02:35:40 +04:00
return ret ;
2022-09-06 22:48:50 +03:00
munmap_failed :
limits_failed :
mmap_write_unlock ( mm ) ;
return ret ;
2012-04-21 02:35:40 +04:00
}
powerpc: do not make the entire heap executable
On 32-bit powerpc the ELF PLT sections of binaries (built with
--bss-plt, or with a toolchain which defaults to it) look like this:
[17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
[18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
[19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4
Which results in an ELF load header:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000
This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping
ends at 0002aff8, so the PLT falls in the 0 fill section described by
the load header, and after a page boundary.
Unfortunately the generic ELF loader ignores the X bit in the load
headers when it creates the 0 filled non-file backed mappings. It
assumes all of these mappings are RW BSS sections, which is not the case
for PPC.
gcc/ld has an option (--secure-plt) to not do this, this is said to
incur a small performance penalty.
Currently, to support 32-bit binaries with PLT in BSS kernel maps
*entire brk area* with executable rights for all binaries, even
--secure-plt ones.
Stop doing that.
Teach the ELF loader to check the X bit in the relevant load header and
create 0 filled anonymous mappings that are executable if the load
header requests that.
Test program showing the difference in /proc/$PID/maps:
int main() {
char buf[16*1024];
char *p = malloc(123); /* make "[heap]" mapping appear */
int fd = open("/proc/self/maps", O_RDONLY);
int len = read(fd, buf, sizeof(buf));
write(1, buf, len);
printf("%p\n", p);
return 0;
}
Compiled using: gcc -mbss-plt -m32 -Os test.c -otest
Unpatched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10690000-106c0000 rwxp 00000000 00:00 0 [heap]
f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
0x10690008
Patched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10180000-101b0000 rw-p 00000000 00:00 0 [heap]
^^^^ this has changed
f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
0x10180008
The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:
https://lkml.org/lkml/2012/9/30/138
Lightly run-tested.
Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-23 02:45:16 +03:00
EXPORT_SYMBOL ( vm_brk_flags ) ;
int vm_brk ( unsigned long addr , unsigned long len )
{
return vm_brk_flags ( addr , len , 0 ) ;
}
2012-04-21 02:35:40 +04:00
EXPORT_SYMBOL ( vm_brk ) ;
2005-04-17 02:20:36 +04:00
/* Release all mmaps. */
void exit_mmap ( struct mm_struct * mm )
{
2011-05-25 04:11:45 +04:00
struct mmu_gather tlb ;
2008-10-19 07:26:50 +04:00
struct vm_area_struct * vma ;
2005-04-17 02:20:36 +04:00
unsigned long nr_accounted = 0 ;
2022-09-06 22:49:06 +03:00
MA_STATE ( mas , & mm - > mm_mt , 0 , 0 ) ;
int count = 0 ;
2005-04-17 02:20:36 +04:00
2007-05-02 21:27:14 +04:00
/* mm's last user has gone, and its about to be pulled down */
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 02:46:29 +04:00
mmu_notifier_release ( mm ) ;
2007-05-02 21:27:14 +04:00
2022-06-01 01:30:59 +03:00
mmap_read_lock ( mm ) ;
mm: rearrange exit_mmap() to unlock before arch_exit_mmap
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:
> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)
Jeremy explained:
Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.
As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.
This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.
Thus, this patch:
exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Keir Fraser <keir.fraser@eu.citrix.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Alex Williamson <alex.williamson@hp.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-12 00:04:41 +03:00
arch_exit_mmap ( mm ) ;
2022-09-06 22:49:06 +03:00
vma = mas_find ( & mas , ULONG_MAX ) ;
2022-01-15 01:06:14 +03:00
if ( ! vma ) {
/* Can happen if dup_mmap() received an OOM */
2022-06-01 01:30:59 +03:00
mmap_read_unlock ( mm ) ;
mm: rearrange exit_mmap() to unlock before arch_exit_mmap
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:
> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)
Jeremy explained:
Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.
As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.
This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.
Thus, this patch:
exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Keir Fraser <keir.fraser@eu.citrix.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Alex Williamson <alex.williamson@hp.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-12 00:04:41 +03:00
return ;
2022-01-15 01:06:14 +03:00
}
mm: rearrange exit_mmap() to unlock before arch_exit_mmap
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:
> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)
Jeremy explained:
Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.
As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.
This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.
Thus, this patch:
exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Keir Fraser <keir.fraser@eu.citrix.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Alex Williamson <alex.williamson@hp.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-12 00:04:41 +03:00
2005-04-17 02:20:36 +04:00
lru_add_drain ( ) ;
flush_cache_mm ( mm ) ;
2021-01-28 02:53:44 +03:00
tlb_gather_mmu_fullmm ( & tlb , mm ) ;
2009-01-07 01:40:29 +03:00
/* update_hiwater_rss(mm) here? but nobody should be looking */
2022-09-06 22:49:06 +03:00
/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
2023-01-26 22:37:51 +03:00
unmap_vmas ( & tlb , & mm - > mm_mt , vma , 0 , ULONG_MAX , false ) ;
2022-06-01 01:30:59 +03:00
mmap_read_unlock ( mm ) ;
/*
* Set MMF_OOM_SKIP to hide this task from the oom killer / reaper
2022-06-01 01:31:00 +03:00
* because the memory has been already freed .
2022-06-01 01:30:59 +03:00
*/
set_bit ( MMF_OOM_SKIP , & mm - > flags ) ;
mmap_write_lock ( mm ) ;
2023-02-27 20:36:07 +03:00
mt_clear_in_rcu ( & mm - > mm_mt ) ;
2022-09-06 22:49:06 +03:00
free_pgtables ( & tlb , & mm - > mm_mt , vma , FIRST_USER_ADDRESS ,
2023-02-27 20:36:18 +03:00
USER_PGTABLES_CEILING , true ) ;
2021-01-28 02:53:43 +03:00
tlb_finish_mmu ( & tlb ) ;
2005-04-17 02:20:36 +04:00
2022-09-06 22:49:06 +03:00
/*
* Walk the list again , actually closing and freeing it , with preemption
* enabled , without holding any MM locks besides the unreachable
* mmap_write_lock .
*/
do {
2012-05-07 00:54:06 +04:00
if ( vma - > vm_flags & VM_ACCOUNT )
nr_accounted + = vma_pages ( vma ) ;
2023-02-27 20:36:31 +03:00
remove_vma ( vma , true ) ;
2022-09-06 22:49:06 +03:00
count + + ;
2020-04-17 02:46:10 +03:00
cond_resched ( ) ;
2022-09-06 22:49:06 +03:00
} while ( ( vma = mas_find ( & mas , ULONG_MAX ) ) ! = NULL ) ;
BUG_ON ( count ! = mm - > map_count ) ;
2022-09-06 22:48:45 +03:00
trace_exit_mmap ( mm ) ;
__mt_destroy ( & mm - > mm_mt ) ;
2022-01-15 01:06:14 +03:00
mmap_write_unlock ( mm ) ;
2012-05-07 00:54:06 +04:00
vm_unacct_memory ( nr_accounted ) ;
2005-04-17 02:20:36 +04:00
}
/* Insert vm structure into process list sorted by address
* and into the inode ' s i_mmap tree . If vm_file is non - NULL
2014-12-13 03:54:24 +03:00
* then i_mmap_rwsem is taken here .
2005-04-17 02:20:36 +04:00
*/
2012-10-09 03:29:07 +04:00
int insert_vm_struct ( struct mm_struct * mm , struct vm_area_struct * vma )
2005-04-17 02:20:36 +04:00
{
2022-09-06 22:48:45 +03:00
unsigned long charged = vma_pages ( vma ) ;
2005-04-17 02:20:36 +04:00
2022-09-06 22:48:45 +03:00
2022-09-06 22:49:06 +03:00
if ( find_vma_intersection ( mm , vma - > vm_start , vma - > vm_end ) )
2015-09-09 01:04:08 +03:00
return - ENOMEM ;
2022-09-06 22:48:45 +03:00
2015-09-09 01:04:08 +03:00
if ( ( vma - > vm_flags & VM_ACCOUNT ) & &
2022-09-06 22:48:45 +03:00
security_vm_enough_memory_mm ( mm , charged ) )
2015-09-09 01:04:08 +03:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
/*
* The vm_pgoff of a purely anonymous vma should be irrelevant
* until its first write fault , when page ' s anon_vma and index
* are set . But now set the vm_pgoff it will almost certainly
* end up with ( unless mremap moves it elsewhere before that
* first wfault ) , so / proc / pid / maps tells a consistent story .
*
* By setting it to reflect the virtual start address of the
* vma , merges and splits can happen in a seamless way , just
* using the existing file pgoff checks and manipulations .
2020-10-14 02:54:18 +03:00
* Similarly in do_mmap and in do_brk_flags .
2005-04-17 02:20:36 +04:00
*/
mmap: fix the usage of ->vm_pgoff in special_mapping paths
Test-case:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <assert.h>
void *find_vdso_vaddr(void)
{
FILE *perl;
char buf[32] = {};
perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
fread(buf, sizeof(buf), 1, perl);
fclose(perl);
return (void *)atol(buf);
}
#define PAGE_SIZE 4096
int main(void)
{
void *vdso = find_vdso_vaddr();
assert(vdso);
// of course they should differ, and they do so far
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
// split into 2 vma's
assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
// force another fault on the next check
assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
// now they no longer differ, the 2nd vm_pgoff is wrong
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
return 0;
}
Output:
vdso pages differ: 1
vdso pages differ: 0
This is because split_vma() correctly updates ->vm_pgoff, but the logic
in insert_vm_struct() and special_mapping_fault() is absolutely broken,
so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
if you simply unmap the 1st page.
special_mapping_fault() does:
pgoff = vmf->pgoff - vma->vm_pgoff;
and this is _only_ correct if vma->vm_start mmaps the first page from
->vm_private_data array.
vdso or any other user of install_special_mapping() is not anonymous,
it has the "backing storage" even if it is just the array of pages.
So we actually need to make vm_pgoff work as an offset in this array.
Note: this also allows to fix another problem: currently gdb can't access
"[vvar]" memory because in this case special_mapping_fault() doesn't work.
Now that we can use ->vm_pgoff we can implement ->access() and fix this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 00:58:31 +03:00
if ( vma_is_anonymous ( vma ) ) {
2005-04-17 02:20:36 +04:00
BUG_ON ( vma - > anon_vma ) ;
vma - > vm_pgoff = vma - > vm_start > > PAGE_SHIFT ;
}
uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints
Add uprobes support to the core kernel, with x86 support.
This commit adds the kernel facilities, the actual uprobes
user-space ABI and perf probe support comes in later commits.
General design:
Uprobes are maintained in an rb-tree indexed by inode and offset
(the offset here is from the start of the mapping). For a unique
(inode, offset) tuple, there can be at most one uprobe in the
rb-tree.
Since the (inode, offset) tuple identifies a unique uprobe, more
than one user may be interested in the same uprobe. This provides
the ability to connect multiple 'consumers' to the same uprobe.
Each consumer defines a handler and a filter (optional). The
'handler' is run every time the uprobe is hit, if it matches the
'filter' criteria.
The first consumer of a uprobe causes the breakpoint to be
inserted at the specified address and subsequent consumers are
appended to this list. On subsequent probes, the consumer gets
appended to the existing list of consumers. The breakpoint is
removed when the last consumer unregisters. For all other
unregisterations, the consumer is removed from the list of
consumers.
Given a inode, we get a list of the mms that have mapped the
inode. Do the actual registration if mm maps the page where a
probe needs to be inserted/removed.
We use a temporary list to walk through the vmas that map the
inode.
- The number of maps that map the inode, is not known before we
walk the rmap and keeps changing.
- extending vm_area_struct wasn't recommended, it's a
size-critical data structure.
- There can be more than one maps of the inode in the same mm.
We add callbacks to the mmap methods to keep an eye on text vmas
that are of interest to uprobes. When a vma of interest is mapped,
we insert the breakpoint at the right address.
Uprobe works by replacing the instruction at the address defined
by (inode, offset) with the arch specific breakpoint
instruction. We save a copy of the original instruction at the
uprobed address.
This is needed for:
a. executing the instruction out-of-line (xol).
b. instruction analysis for any subsequent fixups.
c. restoring the instruction back when the uprobe is unregistered.
We insert or delete a breakpoint instruction, and this
breakpoint instruction is assumed to be the smallest instruction
available on the platform. For fixed size instruction platforms
this is trivially true, for variable size instruction platforms
the breakpoint instruction is typically the smallest (often a
single byte).
Writing the instruction is done by COWing the page and changing
the instruction during the copy, this even though most platforms
allow atomic writes of the breakpoint instruction. This also
mirrors the behaviour of a ptrace() memory write to a PRIVATE
file map.
The core worker is derived from KSM's replace_page() logic.
In essence, similar to KSM:
a. allocate a new page and copy over contents of the page that
has the uprobed vaddr
b. modify the copy and insert the breakpoint at the required
address
c. switch the original page with the copy containing the
breakpoint
d. flush page tables.
replace_page() is being replicated here because of some minor
changes in the type of pages and also because Hugh Dickins had
plans to improve replace_page() for KSM specific work.
Instruction analysis on x86 is based on instruction decoder and
determines if an instruction can be probed and determines the
necessary fixups after singlestep. Instruction analysis is done
at probe insertion time so that we avoid having to repeat the
same analysis every time a probe is hit.
A lot of code here is due to the improvement/suggestions/inputs
from Peter Zijlstra.
Changelog:
(v10):
- Add code to clear REX.B prefix as suggested by Denys Vlasenko
and Masami Hiramatsu.
(v9):
- Use insn_offset_modrm as suggested by Masami Hiramatsu.
(v7):
Handle comments from Peter Zijlstra:
- Dont take reference to inode. (expect inode to uprobe_register to be sane).
- Use PTR_ERR to set the return value.
- No need to take reference to inode.
- use PTR_ERR to return error value.
- register and uprobe_unregister share code.
(v5):
- Modified del_consumer as per comments from Peter.
- Drop reference to inode before dropping reference to uprobe.
- Use i_size_read(inode) instead of inode->i_size.
- Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
- Includes errno.h as recommended by Stephen Rothwell to fix a build issue
on sparc defconfig
- Remove restrictions while unregistering.
- Earlier code leaked inode references under some conditions while
registering/unregistering.
- Continue the vma-rmap walk even if the intermediate vma doesnt
meet the requirements.
- Validate the vma found by find_vma before inserting/removing the
breakpoint
- Call del_consumer under mutex_lock.
- Use hash locks.
- Handle mremap.
- Introduce find_least_offset_node() instead of close match logic in
find_uprobe
- Uprobes no more depends on MM_OWNER; No reference to task_structs
while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
have valid content.
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Also-written-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
[ Made various small edits to the commit log ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-09 13:26:42 +04:00
2022-09-06 22:49:06 +03:00
if ( vma_link ( mm , vma ) ) {
2022-09-06 22:48:45 +03:00
vm_unacct_memory ( charged ) ;
return - ENOMEM ;
}
2005-04-17 02:20:36 +04:00
return 0 ;
}
/*
* Copy the vma structure to a new location in the same mm ,
* prior to moving page table entries , to effect an mremap move .
*/
struct vm_area_struct * copy_vma ( struct vm_area_struct * * vmap ,
2012-10-09 03:31:50 +04:00
unsigned long addr , unsigned long len , pgoff_t pgoff ,
bool * need_rmap_locks )
2005-04-17 02:20:36 +04:00
{
struct vm_area_struct * vma = * vmap ;
unsigned long vma_start = vma - > vm_start ;
struct mm_struct * mm = vma - > vm_mm ;
struct vm_area_struct * new_vma , * prev ;
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:08:05 +04:00
bool faulted_in_anon_vma = true ;
2023-01-20 19:26:26 +03:00
VMA_ITERATOR ( vmi , mm , addr ) ;
2005-04-17 02:20:36 +04:00
/*
* If anonymous vma has not yet been faulted , update new pgoff
* to match new location , to increase its chance of merging .
*/
mremap: fix the wrong !vma->vm_file check in copy_vma()
Test-case:
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <assert.h>
void *find_vdso_vaddr(void)
{
FILE *perl;
char buf[32] = {};
perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
fread(buf, sizeof(buf), 1, perl);
fclose(perl);
return (void *)atol(buf);
}
#define PAGE_SIZE 4096
void *get_unmapped_area(void)
{
void *p = mmap(0, PAGE_SIZE, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
assert(p != MAP_FAILED);
munmap(p, PAGE_SIZE);
return p;
}
char save[2][PAGE_SIZE];
int main(void)
{
void *vdso = find_vdso_vaddr();
void *page[2];
assert(vdso);
memcpy(save, vdso, sizeof (save));
// force another fault on the next check
assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
page[0] = mremap(vdso,
PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
get_unmapped_area());
page[1] = mremap(vdso + PAGE_SIZE,
PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
get_unmapped_area());
assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
printf("match: %d %d\n",
!memcmp(save[0], page[0], PAGE_SIZE),
!memcmp(save[1], page[1], PAGE_SIZE));
return 0;
}
fails without this patch. Before the previous commit it gets the wrong
page, now it segfaults (which is imho better).
This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
is irrelevant until the first fault which will use do_anonymous_page().
This is obviously wrong for the special mapping.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 00:58:34 +03:00
if ( unlikely ( vma_is_anonymous ( vma ) & & ! vma - > anon_vma ) ) {
2005-04-17 02:20:36 +04:00
pgoff = addr > > PAGE_SHIFT ;
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:08:05 +04:00
faulted_in_anon_vma = false ;
}
2005-04-17 02:20:36 +04:00
2022-09-06 22:49:06 +03:00
new_vma = find_vma_prev ( mm , addr , & prev ) ;
if ( new_vma & & new_vma - > vm_start < addr + len )
2012-10-09 03:29:07 +04:00
return NULL ; /* should never get here */
2022-09-06 22:48:48 +03:00
2023-01-20 19:26:30 +03:00
new_vma = vma_merge ( & vmi , mm , prev , addr , addr + len , vma - > vm_flags ,
2015-09-05 01:46:24 +03:00
vma - > anon_vma , vma - > vm_file , pgoff , vma_policy ( vma ) ,
2022-03-05 07:28:51 +03:00
vma - > vm_userfaultfd_ctx , anon_vma_name ( vma ) ) ;
2005-04-17 02:20:36 +04:00
if ( new_vma ) {
/*
* Source vma may have been merged into new_vma
*/
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:08:05 +04:00
if ( unlikely ( vma_start > = new_vma - > vm_start & &
vma_start < new_vma - > vm_end ) ) {
/*
* The only way we can get a vma_merge with
* self during an mremap is if the vma hasn ' t
* been faulted in yet and we were allowed to
* reset the dst vma - > vm_pgoff to the
* destination address of the mremap to allow
* the merge to happen . mremap must change the
* vm_pgoff linearity between src and dst vmas
* ( in turn preventing a vma_merge ) to be
* safe . It is only safe to keep the vm_pgoff
* linear if there are no pages mapped yet .
*/
2014-10-10 02:28:10 +04:00
VM_BUG_ON_VMA ( faulted_in_anon_vma , new_vma ) ;
2012-10-09 03:31:50 +04:00
* vmap = vma = new_vma ;
2012-10-09 03:31:36 +04:00
}
2012-10-09 03:31:50 +04:00
* need_rmap_locks = ( new_vma - > vm_pgoff < = vma - > vm_pgoff ) ;
2005-04-17 02:20:36 +04:00
} else {
2018-07-21 23:48:51 +03:00
new_vma = vm_area_dup ( vma ) ;
2015-09-09 01:03:38 +03:00
if ( ! new_vma )
goto out ;
new_vma - > vm_start = addr ;
new_vma - > vm_end = addr + len ;
new_vma - > vm_pgoff = pgoff ;
if ( vma_dup_policy ( vma , new_vma ) )
goto out_free_vma ;
if ( anon_vma_clone ( new_vma , vma ) )
goto out_free_mempol ;
if ( new_vma - > vm_file )
get_file ( new_vma - > vm_file ) ;
if ( new_vma - > vm_ops & & new_vma - > vm_ops - > open )
new_vma - > vm_ops - > open ( new_vma ) ;
2023-02-27 20:36:16 +03:00
vma_start_write ( new_vma ) ;
2022-09-06 22:49:06 +03:00
if ( vma_link ( mm , new_vma ) )
2022-09-06 22:48:48 +03:00
goto out_vma_link ;
2015-09-09 01:03:38 +03:00
* need_rmap_locks = false ;
2005-04-17 02:20:36 +04:00
}
return new_vma ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
2022-09-06 22:48:48 +03:00
out_vma_link :
if ( new_vma - > vm_ops & & new_vma - > vm_ops - > close )
new_vma - > vm_ops - > close ( new_vma ) ;
2022-10-11 23:36:51 +03:00
if ( new_vma - > vm_file )
fput ( new_vma - > vm_file ) ;
unlink_anon_vmas ( new_vma ) ;
2015-09-09 01:03:38 +03:00
out_free_mempol :
2013-09-12 01:20:14 +04:00
mpol_put ( vma_policy ( new_vma ) ) ;
2015-09-09 01:03:38 +03:00
out_free_vma :
2018-07-21 23:48:51 +03:00
vm_area_free ( new_vma ) ;
2015-09-09 01:03:38 +03:00
out :
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
return NULL ;
2005-04-17 02:20:36 +04:00
}
2005-05-01 19:58:35 +04:00
/*
* Return true if the calling process may expand its vm space by the passed
* number of pages
*/
2016-01-15 02:22:07 +03:00
bool may_expand_vm ( struct mm_struct * mm , vm_flags_t flags , unsigned long npages )
2005-05-01 19:58:35 +04:00
{
2016-01-15 02:22:07 +03:00
if ( mm - > total_vm + npages > rlimit ( RLIMIT_AS ) > > PAGE_SHIFT )
return false ;
2005-05-01 19:58:35 +04:00
2016-02-03 03:57:43 +03:00
if ( is_data_mapping ( flags ) & &
mm - > data_vm + npages > rlimit ( RLIMIT_DATA ) > > PAGE_SHIFT ) {
2016-05-21 02:57:45 +03:00
/* Workaround for Valgrind */
if ( rlimit ( RLIMIT_DATA ) = = 0 & &
mm - > data_vm + npages < = rlimit_max ( RLIMIT_DATA ) > > PAGE_SHIFT )
return true ;
2018-04-06 02:22:05 +03:00
pr_warn_once ( " %s (%d): VmData %lu exceed data ulimit %lu. Update limits%s. \n " ,
current - > comm , current - > pid ,
( mm - > data_vm + npages ) < < PAGE_SHIFT ,
rlimit ( RLIMIT_DATA ) ,
ignore_rlimit_data ? " " : " or use boot option ignore_rlimit_data " ) ;
if ( ! ignore_rlimit_data )
2016-02-03 03:57:43 +03:00
return false ;
}
2005-05-01 19:58:35 +04:00
2016-01-15 02:22:07 +03:00
return true ;
}
void vm_stat_account ( struct mm_struct * mm , vm_flags_t flags , long npages )
{
2021-11-05 23:38:12 +03:00
WRITE_ONCE ( mm - > total_vm , READ_ONCE ( mm - > total_vm ) + npages ) ;
2016-01-15 02:22:07 +03:00
2016-02-03 03:57:43 +03:00
if ( is_exec_mapping ( flags ) )
2016-01-15 02:22:07 +03:00
mm - > exec_vm + = npages ;
2016-02-03 03:57:43 +03:00
else if ( is_stack_mapping ( flags ) )
2016-01-15 02:22:07 +03:00
mm - > stack_vm + = npages ;
2016-02-03 03:57:43 +03:00
else if ( is_data_mapping ( flags ) )
2016-01-15 02:22:07 +03:00
mm - > data_vm + = npages ;
2005-05-01 19:58:35 +04:00
}
2007-02-09 01:20:41 +03:00
2018-06-08 03:08:04 +03:00
static vm_fault_t special_mapping_fault ( struct vm_fault * vmf ) ;
2014-05-20 02:58:33 +04:00
/*
* Having a close hook prevents vma merging regardless of flags .
*/
static void special_mapping_close ( struct vm_area_struct * vma )
{
}
static const char * special_mapping_name ( struct vm_area_struct * vma )
{
return ( ( struct vm_special_mapping * ) vma - > vm_private_data ) - > name ;
}
2021-04-30 08:57:48 +03:00
static int special_mapping_mremap ( struct vm_area_struct * new_vma )
2016-06-28 14:35:38 +03:00
{
struct vm_special_mapping * sm = new_vma - > vm_private_data ;
2017-06-19 19:32:42 +03:00
if ( WARN_ON_ONCE ( current - > mm ! = new_vma - > vm_mm ) )
return - EFAULT ;
2016-06-28 14:35:38 +03:00
if ( sm - > mremap )
return sm - > mremap ( sm , new_vma ) ;
2017-06-19 19:32:42 +03:00
2016-06-28 14:35:38 +03:00
return 0 ;
}
2020-12-15 06:08:25 +03:00
static int special_mapping_split ( struct vm_area_struct * vma , unsigned long addr )
{
/*
* Forbid splitting special mappings - kernel has expectations over
* the number of pages in mapping . Together with VM_DONTEXPAND
* the size of vma should stay the same over the special mapping ' s
* lifetime .
*/
return - EINVAL ;
}
2014-05-20 02:58:33 +04:00
static const struct vm_operations_struct special_mapping_vmops = {
. close = special_mapping_close ,
. fault = special_mapping_fault ,
2016-06-28 14:35:38 +03:00
. mremap = special_mapping_mremap ,
2014-05-20 02:58:33 +04:00
. name = special_mapping_name ,
2019-11-12 04:27:13 +03:00
/* vDSO code relies that VVAR can't be accessed remotely */
. access = NULL ,
2020-12-15 06:08:25 +03:00
. may_split = special_mapping_split ,
2014-05-20 02:58:33 +04:00
} ;
static const struct vm_operations_struct legacy_special_mapping_vmops = {
. close = special_mapping_close ,
. fault = special_mapping_fault ,
} ;
2007-02-09 01:20:41 +03:00
2018-06-08 03:08:04 +03:00
static vm_fault_t special_mapping_fault ( struct vm_fault * vmf )
2007-02-09 01:20:41 +03:00
{
2017-02-25 01:56:41 +03:00
struct vm_area_struct * vma = vmf - > vma ;
2008-02-09 03:15:19 +03:00
pgoff_t pgoff ;
2007-02-09 01:20:41 +03:00
struct page * * pages ;
2015-12-30 07:12:19 +03:00
if ( vma - > vm_ops = = & legacy_special_mapping_vmops ) {
2014-05-20 02:58:33 +04:00
pages = vma - > vm_private_data ;
2015-12-30 07:12:19 +03:00
} else {
struct vm_special_mapping * sm = vma - > vm_private_data ;
if ( sm - > fault )
2017-02-25 01:56:41 +03:00
return sm - > fault ( sm , vmf - > vma , vmf ) ;
2015-12-30 07:12:19 +03:00
pages = sm - > pages ;
}
2014-05-20 02:58:33 +04:00
mmap: fix the usage of ->vm_pgoff in special_mapping paths
Test-case:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <assert.h>
void *find_vdso_vaddr(void)
{
FILE *perl;
char buf[32] = {};
perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
fread(buf, sizeof(buf), 1, perl);
fclose(perl);
return (void *)atol(buf);
}
#define PAGE_SIZE 4096
int main(void)
{
void *vdso = find_vdso_vaddr();
assert(vdso);
// of course they should differ, and they do so far
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
// split into 2 vma's
assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
// force another fault on the next check
assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
// now they no longer differ, the 2nd vm_pgoff is wrong
printf("vdso pages differ: %d\n",
!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
return 0;
}
Output:
vdso pages differ: 1
vdso pages differ: 0
This is because split_vma() correctly updates ->vm_pgoff, but the logic
in insert_vm_struct() and special_mapping_fault() is absolutely broken,
so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
if you simply unmap the 1st page.
special_mapping_fault() does:
pgoff = vmf->pgoff - vma->vm_pgoff;
and this is _only_ correct if vma->vm_start mmaps the first page from
->vm_private_data array.
vdso or any other user of install_special_mapping() is not anonymous,
it has the "backing storage" even if it is just the array of pages.
So we actually need to make vm_pgoff work as an offset in this array.
Note: this also allows to fix another problem: currently gdb can't access
"[vvar]" memory because in this case special_mapping_fault() doesn't work.
Now that we can use ->vm_pgoff we can implement ->access() and fix this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 00:58:31 +03:00
for ( pgoff = vmf - > pgoff ; pgoff & & * pages ; + + pages )
2008-02-09 03:15:19 +03:00
pgoff - - ;
2007-02-09 01:20:41 +03:00
if ( * pages ) {
struct page * page = * pages ;
get_page ( page ) ;
2008-02-09 03:15:19 +03:00
vmf - > page = page ;
return 0 ;
2007-02-09 01:20:41 +03:00
}
2008-02-09 03:15:19 +03:00
return VM_FAULT_SIGBUS ;
2007-02-09 01:20:41 +03:00
}
2014-05-20 02:58:33 +04:00
static struct vm_area_struct * __install_special_mapping (
struct mm_struct * mm ,
unsigned long addr , unsigned long len ,
2015-11-06 05:48:41 +03:00
unsigned long vm_flags , void * priv ,
const struct vm_operations_struct * ops )
2007-02-09 01:20:41 +03:00
{
2010-12-09 17:29:42 +03:00
int ret ;
2007-02-09 01:20:41 +03:00
struct vm_area_struct * vma ;
2018-07-22 01:24:03 +03:00
vma = vm_area_alloc ( mm ) ;
2007-02-09 01:20:41 +03:00
if ( unlikely ( vma = = NULL ) )
2014-03-18 02:22:02 +04:00
return ERR_PTR ( - ENOMEM ) ;
2007-02-09 01:20:41 +03:00
vma - > vm_start = addr ;
vma - > vm_end = addr + len ;
2023-01-26 22:37:48 +03:00
vm_flags_init ( vma , ( vm_flags | mm - > def_flags |
VM_DONTEXPAND | VM_SOFTDIRTY ) & ~ VM_LOCKED_MASK ) ;
2007-10-19 10:39:15 +04:00
vma - > vm_page_prot = vm_get_page_prot ( vma - > vm_flags ) ;
2007-02-09 01:20:41 +03:00
2014-05-20 02:58:33 +04:00
vma - > vm_ops = ops ;
vma - > vm_private_data = priv ;
2007-02-09 01:20:41 +03:00
2010-12-09 17:29:42 +03:00
ret = insert_vm_struct ( mm , vma ) ;
if ( ret )
goto out ;
2007-02-09 01:20:41 +03:00
2016-01-15 02:22:07 +03:00
vm_stat_account ( mm , vma - > vm_flags , len > > PAGE_SHIFT ) ;
2007-02-09 01:20:41 +03:00
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
perf_event_mmap ( vma ) ;
2009-06-05 16:04:55 +04:00
2014-03-18 02:22:02 +04:00
return vma ;
2010-12-09 17:29:42 +03:00
out :
2018-07-21 23:48:51 +03:00
vm_area_free ( vma ) ;
2014-03-18 02:22:02 +04:00
return ERR_PTR ( ret ) ;
}
2016-09-05 16:33:05 +03:00
bool vma_is_special_mapping ( const struct vm_area_struct * vma ,
const struct vm_special_mapping * sm )
{
return vma - > vm_private_data = = sm & &
( vma - > vm_ops = = & special_mapping_vmops | |
vma - > vm_ops = = & legacy_special_mapping_vmops ) ;
}
2014-05-20 02:58:33 +04:00
/*
2020-06-09 07:33:54 +03:00
* Called with mm - > mmap_lock held for writing .
2014-05-20 02:58:33 +04:00
* Insert a new vma covering the given region , with the given flags .
* Its pages are supplied by the given array of struct page * .
* The array can be shorter than len > > PAGE_SHIFT if it ' s null - terminated .
* The region past the last page supplied will always produce SIGBUS .
* The array pointer and the pages it points to are assumed to stay alive
* for as long as this mapping might exist .
*/
struct vm_area_struct * _install_special_mapping (
struct mm_struct * mm ,
unsigned long addr , unsigned long len ,
unsigned long vm_flags , const struct vm_special_mapping * spec )
{
2015-11-06 05:48:41 +03:00
return __install_special_mapping ( mm , addr , len , vm_flags , ( void * ) spec ,
& special_mapping_vmops ) ;
2014-05-20 02:58:33 +04:00
}
2014-03-18 02:22:02 +04:00
int install_special_mapping ( struct mm_struct * mm ,
unsigned long addr , unsigned long len ,
unsigned long vm_flags , struct page * * pages )
{
2014-05-20 02:58:33 +04:00
struct vm_area_struct * vma = __install_special_mapping (
2015-11-06 05:48:41 +03:00
mm , addr , len , vm_flags , ( void * ) pages ,
& legacy_special_mapping_vmops ) ;
2014-03-18 02:22:02 +04:00
2014-06-05 03:07:05 +04:00
return PTR_ERR_OR_ZERO ( vma ) ;
2007-02-09 01:20:41 +03:00
}
2008-07-29 02:46:26 +04:00
static DEFINE_MUTEX ( mm_all_locks_mutex ) ;
2008-08-11 11:30:25 +04:00
static void vm_lock_anon_vma ( struct mm_struct * mm , struct anon_vma * anon_vma )
2008-07-29 02:46:26 +04:00
{
2017-09-09 02:15:08 +03:00
if ( ! test_bit ( 0 , ( unsigned long * ) & anon_vma - > root - > rb_root . rb_root . rb_node ) ) {
2008-07-29 02:46:26 +04:00
/*
* The LSB of head . next can ' t change from under us
* because we hold the mm_all_locks_mutex .
*/
2020-06-09 07:33:47 +03:00
down_write_nest_lock ( & anon_vma - > root - > rwsem , & mm - > mmap_lock ) ;
2008-07-29 02:46:26 +04:00
/*
* We can safely modify head . next after taking the
2012-12-02 23:56:46 +04:00
* anon_vma - > root - > rwsem . If some other vma in this mm shares
2008-07-29 02:46:26 +04:00
* the same anon_vma we won ' t take it again .
*
* No need of atomic instructions here , head . next
* can ' t change from under us thanks to the
2012-12-02 23:56:46 +04:00
* anon_vma - > root - > rwsem .
2008-07-29 02:46:26 +04:00
*/
if ( __test_and_set_bit ( 0 , ( unsigned long * )
2017-09-09 02:15:08 +03:00
& anon_vma - > root - > rb_root . rb_root . rb_node ) )
2008-07-29 02:46:26 +04:00
BUG ( ) ;
}
}
2008-08-11 11:30:25 +04:00
static void vm_lock_mapping ( struct mm_struct * mm , struct address_space * mapping )
2008-07-29 02:46:26 +04:00
{
if ( ! test_bit ( AS_MM_ALL_LOCKS , & mapping - > flags ) ) {
/*
* AS_MM_ALL_LOCKS can ' t change from under us because
* we hold the mm_all_locks_mutex .
*
* Operations on - > flags have to be atomic because
* even if AS_MM_ALL_LOCKS is stable thanks to the
* mm_all_locks_mutex , there may be other cpus
* changing other bitflags in parallel to us .
*/
if ( test_and_set_bit ( AS_MM_ALL_LOCKS , & mapping - > flags ) )
BUG ( ) ;
2020-06-09 07:33:47 +03:00
down_write_nest_lock ( & mapping - > i_mmap_rwsem , & mm - > mmap_lock ) ;
2008-07-29 02:46:26 +04:00
}
}
/*
* This operation locks against the VM for all pte / vma / mm related
* operations that could ever happen on a certain mm . This includes
* vmtruncate , try_to_unmap , and all page faults .
*
2020-06-09 07:33:54 +03:00
* The caller must take the mmap_lock in write mode before calling
2008-07-29 02:46:26 +04:00
* mm_take_all_locks ( ) . The caller isn ' t allowed to release the
2020-06-09 07:33:54 +03:00
* mmap_lock until mm_drop_all_locks ( ) returns .
2008-07-29 02:46:26 +04:00
*
2020-06-09 07:33:54 +03:00
* mmap_lock in write mode is required in order to block all operations
2008-07-29 02:46:26 +04:00
* that could modify pagetables and free pages without need of
2015-02-11 01:09:59 +03:00
* altering the vma layout . It ' s also needed in write mode to avoid new
2008-07-29 02:46:26 +04:00
* anon_vmas to be associated with existing vmas .
*
* A single task can ' t take more than one mm_take_all_locks ( ) in a row
* or it would deadlock .
*
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
* The LSB in anon_vma - > rb_root . rb_node and the AS_MM_ALL_LOCKS bitflag in
2008-07-29 02:46:26 +04:00
* mapping - > flags avoid to take the same lock twice , if more than one
* vma in this mm is backed by the same anon_vma or address_space .
*
2016-01-16 03:57:31 +03:00
* We take locks in following order , accordingly to comment at beginning
* of mm / rmap . c :
* - all hugetlbfs_i_mmap_rwsem_key locks ( aka mapping - > i_mmap_rwsem for
* hugetlb mapping ) ;
2023-02-27 20:36:20 +03:00
* - all vmas marked locked
2016-01-16 03:57:31 +03:00
* - all i_mmap_rwsem locks ;
* - all anon_vma - > rwseml
*
* We can take all locks within these types randomly because the VM code
* doesn ' t nest them and we protected from parallel mm_take_all_locks ( ) by
* mm_all_locks_mutex .
2008-07-29 02:46:26 +04:00
*
* mm_take_all_locks ( ) and mm_drop_all_locks are expensive operations
* that may have to take thousand of locks .
*
* mm_take_all_locks ( ) can fail if it ' s interrupted by signals .
*/
int mm_take_all_locks ( struct mm_struct * mm )
{
struct vm_area_struct * vma ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
struct anon_vma_chain * avc ;
2022-09-06 22:49:06 +03:00
MA_STATE ( mas , & mm - > mm_mt , 0 , 0 ) ;
2008-07-29 02:46:26 +04:00
2022-04-29 09:16:11 +03:00
mmap_assert_write_locked ( mm ) ;
2008-07-29 02:46:26 +04:00
mutex_lock ( & mm_all_locks_mutex ) ;
2023-02-27 20:36:20 +03:00
mas_for_each ( & mas , vma , ULONG_MAX ) {
if ( signal_pending ( current ) )
goto out_unlock ;
vma_start_write ( vma ) ;
}
mas_set ( & mas , 0 ) ;
2022-09-06 22:49:06 +03:00
mas_for_each ( & mas , vma , ULONG_MAX ) {
2008-07-29 02:46:26 +04:00
if ( signal_pending ( current ) )
goto out_unlock ;
2016-01-16 03:57:31 +03:00
if ( vma - > vm_file & & vma - > vm_file - > f_mapping & &
is_vm_hugetlb_page ( vma ) )
vm_lock_mapping ( mm , vma - > vm_file - > f_mapping ) ;
}
2022-09-06 22:49:06 +03:00
mas_set ( & mas , 0 ) ;
mas_for_each ( & mas , vma , ULONG_MAX ) {
2016-01-16 03:57:31 +03:00
if ( signal_pending ( current ) )
goto out_unlock ;
if ( vma - > vm_file & & vma - > vm_file - > f_mapping & &
! is_vm_hugetlb_page ( vma ) )
2008-08-11 11:30:25 +04:00
vm_lock_mapping ( mm , vma - > vm_file - > f_mapping ) ;
2008-07-29 02:46:26 +04:00
}
2008-08-11 11:30:25 +04:00
2022-09-06 22:49:06 +03:00
mas_set ( & mas , 0 ) ;
mas_for_each ( & mas , vma , ULONG_MAX ) {
2008-08-11 11:30:25 +04:00
if ( signal_pending ( current ) )
goto out_unlock ;
if ( vma - > anon_vma )
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
list_for_each_entry ( avc , & vma - > anon_vma_chain , same_vma )
vm_lock_anon_vma ( mm , avc - > anon_vma ) ;
2008-07-29 02:46:26 +04:00
}
2008-08-11 11:30:25 +04:00
2011-11-01 04:08:59 +04:00
return 0 ;
2008-07-29 02:46:26 +04:00
out_unlock :
2011-11-01 04:08:59 +04:00
mm_drop_all_locks ( mm ) ;
return - EINTR ;
2008-07-29 02:46:26 +04:00
}
static void vm_unlock_anon_vma ( struct anon_vma * anon_vma )
{
2017-09-09 02:15:08 +03:00
if ( test_bit ( 0 , ( unsigned long * ) & anon_vma - > root - > rb_root . rb_root . rb_node ) ) {
2008-07-29 02:46:26 +04:00
/*
* The LSB of head . next can ' t change to 0 from under
* us because we hold the mm_all_locks_mutex .
*
* We must however clear the bitflag before unlocking
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 03:31:39 +04:00
* the vma so the users using the anon_vma - > rb_root will
2008-07-29 02:46:26 +04:00
* never see our bitflag .
*
* No need of atomic instructions here , head . next
* can ' t change from under us until we release the
2012-12-02 23:56:46 +04:00
* anon_vma - > root - > rwsem .
2008-07-29 02:46:26 +04:00
*/
if ( ! __test_and_clear_bit ( 0 , ( unsigned long * )
2017-09-09 02:15:08 +03:00
& anon_vma - > root - > rb_root . rb_root . rb_node ) )
2008-07-29 02:46:26 +04:00
BUG ( ) ;
2013-02-23 04:34:40 +04:00
anon_vma_unlock_write ( anon_vma ) ;
2008-07-29 02:46:26 +04:00
}
}
static void vm_unlock_mapping ( struct address_space * mapping )
{
if ( test_bit ( AS_MM_ALL_LOCKS , & mapping - > flags ) ) {
/*
* AS_MM_ALL_LOCKS can ' t change to 0 from under us
* because we hold the mm_all_locks_mutex .
*/
2014-12-13 03:54:21 +03:00
i_mmap_unlock_write ( mapping ) ;
2008-07-29 02:46:26 +04:00
if ( ! test_and_clear_bit ( AS_MM_ALL_LOCKS ,
& mapping - > flags ) )
BUG ( ) ;
}
}
/*
2020-06-09 07:33:54 +03:00
* The mmap_lock cannot be released by the caller until
2008-07-29 02:46:26 +04:00
* mm_drop_all_locks ( ) returns .
*/
void mm_drop_all_locks ( struct mm_struct * mm )
{
struct vm_area_struct * vma ;
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
struct anon_vma_chain * avc ;
2022-09-06 22:49:06 +03:00
MA_STATE ( mas , & mm - > mm_mt , 0 , 0 ) ;
2008-07-29 02:46:26 +04:00
2022-04-29 09:16:11 +03:00
mmap_assert_write_locked ( mm ) ;
2008-07-29 02:46:26 +04:00
BUG_ON ( ! mutex_is_locked ( & mm_all_locks_mutex ) ) ;
2022-09-06 22:49:06 +03:00
mas_for_each ( & mas , vma , ULONG_MAX ) {
2008-07-29 02:46:26 +04:00
if ( vma - > anon_vma )
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 00:42:07 +03:00
list_for_each_entry ( avc , & vma - > anon_vma_chain , same_vma )
vm_unlock_anon_vma ( avc - > anon_vma ) ;
2008-07-29 02:46:26 +04:00
if ( vma - > vm_file & & vma - > vm_file - > f_mapping )
vm_unlock_mapping ( vma - > vm_file - > f_mapping ) ;
}
2023-02-27 20:36:20 +03:00
vma_end_write_all ( mm ) ;
2008-07-29 02:46:26 +04:00
mutex_unlock ( & mm_all_locks_mutex ) ;
}
2009-01-08 15:04:47 +03:00
/*
2017-02-25 01:56:44 +03:00
* initialise the percpu counter for VM
2009-01-08 15:04:47 +03:00
*/
void __init mmap_init ( void )
{
2009-05-01 02:08:51 +04:00
int ret ;
2014-09-08 04:51:29 +04:00
ret = percpu_counter_init ( & vm_committed_as , 0 , GFP_KERNEL ) ;
2009-05-01 02:08:51 +04:00
VM_BUG_ON ( ret ) ;
2009-01-08 15:04:47 +03:00
}
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
/*
* Initialise sysctl_user_reserve_kbytes .
*
* This is intended to prevent a user from starting a single memory hogging
* process , such that they cannot recover ( kill the hog ) in OVERCOMMIT_NEVER
* mode .
*
* The default value is min ( 3 % of free memory , 128 MB )
* 128 MB is enough to recover with sshd / login , bash , and top / kill .
*/
2013-04-30 02:08:12 +04:00
static int init_user_reserve ( void )
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
{
unsigned long free_kbytes ;
2017-09-07 02:23:36 +03:00
free_kbytes = global_zone_page_state ( NR_FREE_PAGES ) < < ( PAGE_SHIFT - 10 ) ;
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 02:08:10 +04:00
sysctl_user_reserve_kbytes = min ( free_kbytes / 32 , 1UL < < 17 ) ;
return 0 ;
}
2014-01-24 03:53:30 +04:00
subsys_initcall ( init_user_reserve ) ;
2013-04-30 02:08:11 +04:00
/*
* Initialise sysctl_admin_reserve_kbytes .
*
* The purpose of sysctl_admin_reserve_kbytes is to allow the sys admin
* to log in and kill a memory hogging process .
*
* Systems with more than 256 MB will reserve 8 MB , enough to recover
* with sshd , bash , and top in OVERCOMMIT_GUESS . Smaller systems will
* only reserve 3 % of free pages by default .
*/
2013-04-30 02:08:12 +04:00
static int init_admin_reserve ( void )
2013-04-30 02:08:11 +04:00
{
unsigned long free_kbytes ;
2017-09-07 02:23:36 +03:00
free_kbytes = global_zone_page_state ( NR_FREE_PAGES ) < < ( PAGE_SHIFT - 10 ) ;
2013-04-30 02:08:11 +04:00
sysctl_admin_reserve_kbytes = min ( free_kbytes / 32 , 1UL < < 13 ) ;
return 0 ;
}
2014-01-24 03:53:30 +04:00
subsys_initcall ( init_admin_reserve ) ;
2013-04-30 02:08:12 +04:00
/*
* Reinititalise user and admin reserves if memory is added or removed .
*
* The default user reserve max is 128 MB , and the default max for the
* admin reserve is 8 MB . These are usually , but not always , enough to
* enable recovery from a memory hogging process using login / sshd , a shell ,
* and tools like top . It may make sense to increase or even disable the
* reserve depending on the existence of swap or variations in the recovery
* tools . So , the admin may have changed them .
*
* If memory is added and the reserves have been eliminated or increased above
* the default max , then we ' ll trust the admin .
*
* If memory is removed and there isn ' t enough free memory , then we
* need to reset the reserves .
*
* Otherwise keep the reserve set by the admin .
*/
static int reserve_mem_notifier ( struct notifier_block * nb ,
unsigned long action , void * data )
{
unsigned long tmp , free_kbytes ;
switch ( action ) {
case MEM_ONLINE :
/* Default max is 128MB. Leave alone if modified by operator. */
tmp = sysctl_user_reserve_kbytes ;
if ( 0 < tmp & & tmp < ( 1UL < < 17 ) )
init_user_reserve ( ) ;
/* Default max is 8MB. Leave alone if modified by operator. */
tmp = sysctl_admin_reserve_kbytes ;
if ( 0 < tmp & & tmp < ( 1UL < < 13 ) )
init_admin_reserve ( ) ;
break ;
case MEM_OFFLINE :
2017-09-07 02:23:36 +03:00
free_kbytes = global_zone_page_state ( NR_FREE_PAGES ) < < ( PAGE_SHIFT - 10 ) ;
2013-04-30 02:08:12 +04:00
if ( sysctl_user_reserve_kbytes > free_kbytes ) {
init_user_reserve ( ) ;
pr_info ( " vm.user_reserve_kbytes reset to %lu \n " ,
sysctl_user_reserve_kbytes ) ;
}
if ( sysctl_admin_reserve_kbytes > free_kbytes ) {
init_admin_reserve ( ) ;
pr_info ( " vm.admin_reserve_kbytes reset to %lu \n " ,
sysctl_admin_reserve_kbytes ) ;
}
break ;
default :
break ;
}
return NOTIFY_OK ;
}
static int __meminit init_reserve_notifier ( void )
{
2022-09-23 06:33:47 +03:00
if ( hotplug_memory_notifier ( reserve_mem_notifier , DEFAULT_CALLBACK_PRI ) )
2014-06-07 01:38:30 +04:00
pr_err ( " Failed registering memory add/remove notifier for admin reserve \n " ) ;
2013-04-30 02:08:12 +04:00
return 0 ;
}
2014-01-24 03:53:30 +04:00
subsys_initcall ( init_reserve_notifier ) ;