IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Continue reducing the irq disabled scope. Check for per-cpu partial slabs with
first with irqs enabled and then recheck with irqs disabled before grabbing
the slab page. Mostly preparatory for the following patches.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
As another step of shortening irq disabled sections in ___slab_alloc(), delay
disabling irqs until we pass the initial checks if there is a cached percpu
slab and it's suitable for our allocation.
Now we have to recheck c->page after actually disabling irqs as an allocation
in irq handler might have replaced it.
Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe()
variant that lacks the PageSlab check.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This
includes cases where this is not needed, such as when the allocation ends up in
the page allocator and has to awkwardly enable irqs back based on gfp flags.
Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when
it hits the __slab_alloc() slow path, and long periods with disabled interrupts
are undesirable.
As a first step towards reducing irq disabled periods, move irq handling into
___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer
from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not
protect against modification by an irq handler, which is still done by disabled
irq for most of ___slab_alloc(). As a small immediate benefit,
slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled.
kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them
before calling ___slab_alloc(), which then disables them at its discretion. The
whole kmem_cache_alloc_bulk() operation also disables preemption.
When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that allow
blocking (this will be improved by later patches).
The patch itself will thus increase overhead a bit due to disabled preemption
(on configs where it matters) and increased disabling/enabling irqs in
kmem_cache_alloc_bulk(), but that will be gradually improved in the following
patches.
Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to
CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in
all configurations. On configs without involuntary preemption and debugging
the re-read of kmem_cache_cpu pointer is still compiled out as it was before.
[ Mike Galbraith <efault@gmx.de>: Fix kmem_cache_alloc_bulk() error path ]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that
our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently
that's done by reading the tid first using this_cpu_read(), then the
kmem_cache_cpu pointer and verifying we read the same tid using the pointer and
plain READ_ONCE().
This can be simplified to just fetching kmem_cache_cpu pointer and then reading
tid using the pointer. That guarantees they are from the same cpu. We don't
need to read the tid using this_cpu_read() because the value will be validated
by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the
freelist didn't change by anyone preempting us since reading the tid.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
When we allocate slab object from a newly acquired page (from node's partial
list or page allocator), we usually also retain the page as a new percpu slab.
There are two exceptions - when pfmemalloc status of the page doesn't match our
gfp flags, or when the cache has debugging enabled.
The current code for these decisions is not easy to follow, so restructure it
and add comments. The new structure will also help with the following changes.
No functional change.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
The function get_partial() finds a suitable page on a partial list, acquires
and returns its freelist and assigns the page pointer to kmem_cache_cpu.
In later patch we will need more control over the kmem_cache_cpu.page
assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a
pointer to a page that get_partial() can fill and the caller can assign the
kmem_cache_cpu.page pointer. No functional change as all of this still happens
with disabled IRQs.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it
there. This is a preparatory step with no functional change.
The only minor change is moving WARN_ON_ONCE() for using a constructor together
with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still
able to catch a development change introducing a systematic misuse.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
The later patches will need more fine grained control over individual actions
in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first
preparatory step with no functional change.
This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmExXHoVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGAZwP/iHdEZzuQ4cz2uXUaV0fevj9jjPU
zJ8wrrNabAiT6f5x861DsARQSR4OSt3zN0tyBNgZwUdotbe7ED5GegrgIUBMWlML
QskhTEIZj7TexAX/20vx671gtzI3JzFg4c9BuriXCFRBvychSevdJPr65gMDOesL
vOJnXe+SGXG2+fPWi/PxrcOItNRcveqo2GiWHT3g0Cv/DJUulu81gEkz3hrufnMR
cjMeSkV0nJJcvI755OQBOUnEuigW64k4m2WxHPG24tU8cQOCqV6lqwOfNQBAn4+F
OoaCMyPQT9gvGYwGExQMCXGg0wbUt1qnxzOVoA2qFCwbo+MFhqjBvPXab6VJm7CE
mY3RrTtvxSqBdHI6EGcYeLjhycK9b+LLoJ1qc3S9FK8It6NoFFp4XV0R6ItPBls7
mWi9VSpyI6k0AwLq+bGXEHvaX/bnnf/vfqn8H+w6mRZdXjFV8EB2DiOSRX/OqjVG
RnvTtXzWWThLyXvWR3Jox4+7X6728oL7akLemoeZI6oTbJDm7dQgwpz5HbSyHXLh
d+gUF3Y/6lqxT5N9GSVDxpD1bEMh2I7nGQ4M7WGbGas/3yUemF8wbBqGQo4a+YeD
d9vGAUxDp2PQTtL2sjFo5Gd4PZEM9g7vwWzRvHe0o5NxKEXcBg25b8cD1hxrN9Y4
Y1AAnc0kLO+My3PC
=lw3M
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately")
introduced cpu partial flushing for kmemcg caches, based on setting the target
cpu_partial to 0 and adding a flushing check in put_cpu_partial().
This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab:
introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03
("mm: memcg/slab: use a single set of kmem_caches for all accounted
allocations"). However the check and flush in put_cpu_partial() was never
removed, although it's effectively a dead code. So this patch removes it.
Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary. But
further patches will rely on it, so keep it.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In slab_free_hook() we disable irqs around the debug_check_no_locks_freed()
call, which is unnecessary, as irqs are already being disabled inside the call.
This seems to be leftover from the past where there were more calls inside the
irq disabled sections. Remove the irq disable/enable operations.
Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate a
private object bitmap instead of grabbing the shared static one meant for
critical sections, so let's do that.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Slub has a static spinlock protected bitmap for marking which objects are on
freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.
The handlers of debugfs files alloc_traces and free_traces also currently use this
shared bitmap, but their syscall context makes it straightforward to allocate a
private map before entering locked sections, so switch these processing paths
to use a private bitmap.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER flag
and as with all slub debugging flags, such caches avoid cpu or percpu partial
slabs altogether, so there's nothing to flush.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
There is a usecase in Android that an app process's memory is swapped out
by process_madvise() with MADV_PAGEOUT, such as the memory is swapped to
zram or a backing device. When the process is scheduled to running, like
switch to foreground, multiple page faults may cause the app dropped
frames.
To reduce the problem, System Management Software can read-ahead memory
of the process immediately when the app switches to forground. Calling
process_madvise() with MADV_WILLNEED can meet this need.
Link: https://lkml.kernel.org/r/20210804082010.12482-1-zhangkui@oppo.com
Signed-off-by: zhangkui <zhangkui@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of pagetypeinfo_showfree and pagetypeinfo_showblockcount
are unused now. Remove them.
Link: https://lkml.kernel.org/r/20210715122911.15700-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can replace the array_num * sizeof(array[0]) with sizeof(array) to
simplify the code.
Link: https://lkml.kernel.org/r/20210715122911.15700-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Cleanup for vmstat".
This series contains cleanups to remove unneeded return value, correct
wrong comment and simplify the array size calculation. More details can
be found in the respective changelogs.
This patch (of 3):
Correct wrong fls(mem+1) to fls(mem)+1 and remove the duplicated comment
with quiet_vmstat().
Link: https://lkml.kernel.org/r/20210715122911.15700-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210715122911.15700-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b239f7daf553 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE")
removed the parameter 'for_alloc', so remove this comment.
Link: https://lkml.kernel.org/r/1630576043-21367-1-git-send-email-jingxiangfeng@huawei.com
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksm_stable_node_chains_prune_millisecs is declared as int, but in
stable__node_chains_prune_millisecs_store(), it can store values up to
UINT_MAX. Change its type to unsigned int.
Link: https://lkml.kernel.org/r/20210806111351.GA71845@asus
Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the expected "Return:" format to prevent a kernel-doc warning.
mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node'
Link: https://lkml.kernel.org/r/20210808203151.10632-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory quickly
and efficiently. Unfortunately the time process takes to free up its
memory after receiving a SIGKILL might vary based on the state of the
process (uninterruptible sleep), size and OPP level of the core the
process is running. A mechanism to free resources of the target process
in a more predictable way would improve system's ability to control its
memory pressure.
Introduce process_mrelease system call that releases memory of a dying
process from the context of the caller. This way the memory is freed in a
more controllable way with CPU affinity and priority of the caller. The
workload of freeing the memory will also be charged to the caller. The
operation is allowed only on a dying process.
After previous discussions [1, 2, 3] the decision was made [4] to
introduce a dedicated system call to cover this use case.
The API is as follows,
int process_mrelease(int pidfd, unsigned int flags);
DESCRIPTION
The process_mrelease() system call is used to free the memory of
an exiting process.
The pidfd selects the process referred to by the PID file
descriptor.
(See pidfd_open(2) for further information)
The flags argument is reserved for future use; currently, this
argument must be specified as 0.
RETURN VALUE
On success, process_mrelease() returns 0. On error, -1 is
returned and errno is set to indicate the error.
ERRORS
EBADF pidfd is not a valid PID file descriptor.
EAGAIN Failed to release part of the address space.
EINTR The call was interrupted by a signal; see signal(7).
EINVAL flags is not 0.
EINVAL The memory of the task cannot be released because the
process is not exiting, the address space is shared
with another live process or there is a core dump in
progress.
ENOSYS This system call is not supported, for example, without
MMU support built into Linux.
ESRCH The target process does not exist (i.e., it has terminated
and been waited on).
[1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/
[2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/
[3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/
[4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/
Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.
memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.
Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.
This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().
Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI]
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv]
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As they all do the same thing: sanity check and save nodemask info, create
one mpol_new_nodemask() to reduce redundancy.
Link: https://lkml.kernel.org/r/1627970362-61305-6-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.
NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.
[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]
Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.com
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it
will first try to allocate memory from the preferred node(s), and fallback
to all nodes in system when first try fails.
Add a dedicated function alloc_pages_preferred_many() for it just like for
'interleave' policy, which will be used by 2 general memoory allocation
APIs: alloc_pages() and alloc_pages_vma()
Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.com
Suggested-by: Michal Hocko <mhocko@suse.com>
Originally-by: Ben Widawsky <ben.widawsky@intel.com>
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce multi-preference mempolicy", v7.
This patch series introduces the concept of the MPOL_PREFERRED_MANY
mempolicy. This mempolicy mode can be used with either the
set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED
interface, it allows an application to set a preference for nodes which
will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode,
it takes a set of nodes. Like the MPOL_BIND interface, it works over a
set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
OOM killer if those preferred nodes are not available.
Along with these patches are patches for libnuma, numactl, numademo, and
memhog. They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
usage: `numactl -P 0,3,4`
The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and
latency requirements allowing preference to be given to all nodes with
"fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast
memory (or perhaps slow memory), but doesn't care which node it runs
on. The application can prefer a set of nodes and then xpu bind to
the local node (cpu, accelerator, etc). This reverses the nodes are
chosen today where the kernel attempts to use local memory to the CPU
whenever possible. This will attempt to use the local accelerator to
the memory.
2. The Tortoise - The administrator (or the application itself) is
aware it only needs slow memory, and so can prefer that.
Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.
Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.
> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added
complexity is nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This
confuses the notion of binding and is less flexible than the current
solution.
3. Create flags or new modes that helps with some ordering. This
offers both a friendlier API as well as a solution for more customized
usage. It's unknown if it's worth the complexity to support this.
Here is sample code for how this might work:
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
This patch (of 5):
The NUMA APIs currently allow passing in a "preferred node" as a single
bit set in a nodemask. If more than one bit it set, bits after the first
are ignored.
This single node is generally OK for location-based NUMA where memory
being allocated will eventually be operated on by a single CPU. However,
in systems with multiple memory types, folks want to target a *type* of
memory instead of a location. For instance, someone might want some
high-bandwidth memory but do not care about the CPU next to which it is
allocated. Or, they want a cheap, high capacity allocation and want to
target all NUMA nodes which have persistent memory in volatile mode. In
both of these cases, the application wants to target a *set* of nodes, but
does not want strict MPOL_BIND behavior as that could lead to OOM killer
or SIGSEGV.
So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
requirement. This is not a pie-in-the-sky dream for an API. This was a
response to a specific ask of more than one group at Intel. Specifically:
1. There are existing libraries that target memory types such as
https://github.com/memkind/memkind. These are known to suffer from
SIGSEGV's when memory is low on targeted memory "kinds" that span more
than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
example of this.
2. Volatile-use persistent memory users want to have a memory policy
which is targeted at either "cheap and slow" (PMEM) or "expensive and
fast" (DRAM). However, they do not want to experience allocation
failures when the targeted type is unavailable.
3. Allocate-then-run. Generally, we let the process scheduler decide
on which physical CPU to run a task. That location provides a default
allocation policy, and memory availability is not generally considered
when placing tasks. For situations where memory is valuable and
constrained, some users want to allocate memory first, *then* allocate
close compute resources to the allocation. This is the reverse of the
normal (CPU) model. Accelerators such as GPUs that operate on
core-mm-managed memory are interested in this model.
A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
[mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling]
Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com
Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>b
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The caller of mpol_misplaced() already use NUMA_NO_NODE to check whether
current page node is misplaced, thus using NUMA_NO_NODE in
mpol_misplaced() instead of magic number is more readable.
Link: https://lkml.kernel.org/r/1b77c0ce21183fa86f4db250b115cf5e27396528.1627558356.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness. Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM. Enabling the proactive
compaction in its state will endup in running almost always on such
systems.
Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness). So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface. As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.
This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a
[akpm@linux-foundation.org: tweak vm.rst, per Mike]
Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka figured out that when fragmentation score didn't go down
across the proactive compaction i.e. when no progress is made, next wake
up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT,
i.e. 64 times, with each wakeup interval of
HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500). In each of this wakeup, it just
decrement 'proactive_defer' counter and goes sleep i.e. it is getting
woken to just decrement a counter.
The same deferral time can also achieved by simply doing the
HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary
wakeup of kcompact thread is avoided thus also removes the need of
'proactive_defer' thread counter.
[akpm@linux-foundation.org: tweak comment]
Link: https://lore.kernel.org/linux-fsdevel/88abfdb6-2c13-b5a6-5b46-742d12d1c910@suse.cz/
Link: https://lkml.kernel.org/r/1626869599-25412-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches
operation. It iterates over all memcgs and calls shrink_slab() which in
turn iterates over all slab shrinkers. Freed objects are counted and as
long as the total number of freed objects from all memcgs and shrinkers is
higher than 10, drop_slab_node() loops for another full memcgs*shrinkers
iteration.
This arbitrary constant threshold of 10 can result in effectively an
infinite loop on a system with large number of memcgs and/or parallel
activity that allocates new objects. This has been reported previously by
Chunxin Zang [1] and recently by our customer.
The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan:
fix infinite loop in drop_slab_node") which added a check for signals
allowing the user to terminate the command writing to drop_caches. At the
time it was also considered to make the threshold grow with each iteration
to guarantee termination, but such patch hasn't been formally proposed
yet.
This patch implements the dynamically growing threshold. At first
iteration it's enough to free one object to continue, and this threshold
effectively doubles with each iteration. Our customer's feedback was
positive.
There is always a risk that this change will result on some system in a
previously terminating drop_caches operation to terminate sooner and free
fewer objects. Ideally the semantics would guarantee freeing all freeable
objects that existed at the moment of starting the operation, while not
looping forever for newly allocated objects, but that's not feasible to
track. In the less ideal solution based on thresholds, arguably the
termination guarantee is more important than the exhaustiveness guarantee.
If there are reports of large regression wrt being exhaustive, we can
tune how fast the threshold grows.
[1] https://lore.kernel.org/lkml/20200909152047.27905-1-zangchunxin@bytedance.com/T/#u
[vbabka@suse.cz: avoid undefined shift behaviour]
Link: https://lkml.kernel.org/r/2f034e6f-a753-550a-f374-e4e23899d3d5@suse.cz
Link: https://lkml.kernel.org/r/20210818152239.25502-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Chunxin Zang <zangchunxin@bytedance.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We could add 'else' to remove the somewhat odd check_pending label to make
code core succinct.
Link: https://lkml.kernel.org/r/20210717065911.61497-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of kswapd_run() is unused now. Clean it up.
Link: https://lkml.kernel.org/r/20210717065911.61497-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The priority field of sc is used to control how many pages we should scan
at once while we always traverse the list to shrink the pages in these
functions. So these settings are unneeded and misleading.
Link: https://lkml.kernel.org/r/20210717065911.61497-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Cleanups for vmscan", v2.
This series contains cleanups to remove unneeded return value, misleading
setting and so on. Also this remove the PageDirty check after MADV_FREE
pages are page_ref_freezed. More details can be found in the respective
changelogs.
This patch (of 4):
If the MADV_FREE pages are redirtied before they could be reclaimed, put
the pages back to anonymous LRU list by setting SwapBacked flag and the
pages will be reclaimed in normal swapout way. But as Yu Zhao pointed
out, "The page has only one reference left, which is from the isolation.
After the caller puts the page back on lru and drops the reference, the
page will be freed anyway. It doesn't matter which lru it goes." So we
don't bother checking PageDirty here.
[Yu Zhao's comment is also quoted in the code.]
Link: https://lkml.kernel.org/r/20210717065911.61497-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210717065911.61497-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so
add a new func helper vmpressure_to_memcg(). And no code will use
vmpressure_to_css(), so delete it.
Link: https://lkml.kernel.org/r/20210630112146.455103-1-suhui@zeku.com
Signed-off-by: Hui Su <suhui@zeku.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some method is obviously needed to enable reclaim-based migration.
Just like traditional autonuma, there will be some workloads that will
benefit like workloads with more "static" configurations where hot pages
stay hot and cold pages stay cold. If pages come and go from the hot and
cold sets, the benefits of this approach will be more limited.
The benefits are truly workload-based and *not* hardware-based. We do not
believe that there is a viable threshold where certain hardware
configurations should have this mechanism enabled while others do not.
To be conservative, earlier work defaulted to disable reclaim- based
migration and did not include a mechanism to enable it. This proposes add
a new sysfs file
/sys/kernel/mm/numa/demotion_enabled
as a method to enable it.
We are open to any alternative that allows end users to enable this
mechanism or disable it if workload harm is detected (just like
traditional autonuma).
Once this is enabled page demotion may move data to a NUMA node that does
not fall into the cpuset of the allocating process. This could be
construed to violate the guarantees of cpusets. However, since this is an
opt-in mechanism, the assumption is that anyone enabling it is content to
relax the guarantees.
Link: https://lkml.kernel.org/r/20210721063926.3024591-9-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-10-ying.huang@intel.com
Signed-off-by: Huang Ying <ying.huang@intel.com>
Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Global reclaim aims to reduce the amount of memory used on a given node or
set of nodes. Migrating pages to another node serves this purpose.
memcg reclaim is different. Its goal is to reduce the total memory
consumption of the entire memcg, across all nodes. Migration does not
assist memcg reclaim because it just moves page contents between nodes
rather than actually reducing memory consumption.
Link: https://lkml.kernel.org/r/20210715055145.195411-9-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim anonymous pages if a migration path is available now that demotion
provides a non-swap recourse for reclaiming anon pages.
Note that this check is subtly different from the can_age_anon_pages()
checks. This mechanism checks whether a specific page in a specific
context can actually be reclaimed, given current swap space and cgroup
limits.
can_age_anon_pages() is a much simpler and more preliminary check which
just says whether there is a possibility of future reclaim.
[kbusch@kernel.org: v11]
Link: https://lkml.kernel.org/r/20210715055145.195411-8-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210721063926.3024591-7-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-8-ying.huang@intel.com
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anonymous pages are kept on their own LRU(s). These lists could
theoretically always be scanned and maintained. But, without swap, there
is currently nothing the kernel can *do* with the results of a scanned,
sorted LRU for anonymous pages.
A check for '!total_swap_pages' currently serves as a valid check as to
whether anonymous LRUs should be maintained. However, another method will
be added shortly: page demotion.
Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.
[dave.hansen@linux.intel.com: v11]
Link: https://lkml.kernel.org/r/20210715055145.195411-7-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210721063926.3024591-6-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-7-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Account the number of demoted pages.
Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.
[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]
Link: https://lkml.kernel.org/r/20210721063926.3024591-5-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-6-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is mostly derived from a patch from Yang Shi:
https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
Add code to the reclaim path (shrink_page_list()) to "demote" data to
another NUMA node instead of discarding the data. This always avoids the
cost of I/O needed to read the page back in and sometimes avoids the
writeout cost when the page is dirty.
A second pass through shrink_page_list() will be made if any demotions
fail. This essentially falls back to normal reclaim behavior in the case
that demotions fail. Previous versions of this patch may have simply
failed to reclaim pages which were eligible for demotion but were unable
to be demoted in practice.
For some cases, for example, MADV_PAGEOUT, the pages are always discarded
instead of demoted to follow the kernel API definition. Because
MADV_PAGEOUT is defined as freeing specified pages regardless in which
tier they are.
Note: This just adds the start of infrastructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().
[dave.hansen@linux.intel.com: v11]
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210721063926.3024591-4-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.
Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.
Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim-based migration is attempting to optimize data placement in memory
based on the system topology. If the system changes, so must the
migration ordering.
The implementation is conceptually simple and entirely unoptimized. On
any memory or CPU hotplug events, assume that a node was added or removed
and recalculate all migration targets. This ensures that the
node_demotion[] array is always ready to be used in case the new reclaim
mode is enabled.
This recalculation is far from optimal, most glaringly that it does not
even attempt to figure out the hotplug event would have some *actual*
effect on the demotion order. But, given the expected paucity of hotplug
events, this should be fine.
Link: https://lkml.kernel.org/r/20210721063926.3024591-2-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-3-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Migrate Pages in lieu of discard", v11.
We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.
Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.
That has two nasty properties. First, the newer allocations can end up in
the slower persistent memory. Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.
This patchset implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.
While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.
This is not perfect. It "strands" pages in slower memory and never brings
them back to fast DRAM. Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.
This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:
http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node. So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.
We have tested the patchset with the postgresql and pgbench. On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case. This comes from the reduced disk read throughput
(which reduces up to 70.8%).
== Open Issues ==
* Memory policies and cpusets that, for instance, restrict allocations
to DRAM can be demoted to PMEM whenever they opt in to this
new mechanism. A cgroup-level API to opt-in or opt-out of
these migrations will likely be required as a follow-on.
* Could be more aggressive about where anon LRU scanning occurs
since it no longer necessarily involves I/O. get_scan_count()
for instance says: "If we have no swap space, do not bother
scanning anon pages"
This patch (of 9):
Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table. This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages. A node with no target is a "terminal
node", so reclaim acts normally there. The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.
When memory fills up on a node, memory contents can be automatically
migrated to another node. The biggest problems are knowing when to
migrate and to where the migration should be targeted.
The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists. Those lists already tell us if
memory is full where to look next. It would also be logical to move
memory in that order.
But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists. This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).
Instead of using the allocator fallback lists directly, keep a separate
node migration ordering. But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().
This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().
RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed. If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed. Details are as follows.
=== What does RCU provide? ===
Imagine a simple loop which walks down the demotion path looking
for the last node:
terminal_node = start_node;
while (node_demotion[terminal_node] != NUMA_NO_NODE) {
terminal_node = node_demotion[terminal_node];
}
The initial values are:
node_demotion[0] = 1;
node_demotion[1] = NUMA_NO_NODE;
and are updated to:
node_demotion[0] = NUMA_NO_NODE;
node_demotion[1] = 0;
What guarantees that the cycle is not observed:
node_demotion[0] = 1;
node_demotion[1] = 0;
and would loop forever?
With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.
RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.
=== What does READ_ONCE() provide? ===
READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[]. This ensures that any updates are *eventually*
observed.
Consider the above loop again. The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].
Note: RCU does not provide any universal compiler-ordering
guarantees:
https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
This code is unused for now. It will be called later in the
series.
Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "userfaultfd: minor bug fixes".
Three unrelated bug fixes. The first two addresses possible issues (not
too theoretical ones), but I did not encounter them in practice.
The third patch addresses a test bug that causes the test to fail on my
system. It has been sent before as part of a bigger RFC.
This patch (of 3):
mmap_changing is currently a boolean variable, which is set and cleared
without any lock that protects against concurrent modifications.
mmap_changing is supposed to mark whether userfaultfd page-faults handling
should be retried since mappings are undergoing a change. However,
concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
mmap_changing to be false, although the remove event was still not read
(hence acknowledged) by the user.
Change mmap_changing to atomic_t and increase/decrease appropriately. Add
a debug assertion to see whether mmap_changing is negative.
Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guillaume Morin reported hitting the following WARNING followed by GPF or
NULL pointer deference either in cgroups_destroy or in the kill_css path.:
percpu ref (css_release) <= 0 (-1) after switching to atomic
WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G O 5.10.60 #1
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
Call Trace:
rcu_core+0x30f/0x530
rcu_core_si+0xe/0x10
__do_softirq+0x103/0x2a2
run_ksoftirqd+0x2b/0x40
smpboot_thread_fn+0x11a/0x170
kthread+0x10a/0x140
ret_from_fork+0x22/0x30
Upon further examination, it was discovered that the css structure was
associated with hugetlb reservations.
For private hugetlb mappings the vma points to a reserve map that
contains a pointer to the css. At mmap time, reservations are set up
and a reference to the css is taken. This reference is dropped in the
vma close operation; hugetlb_vm_op_close. However, if a vma is split no
additional reference to the css is taken yet hugetlb_vm_op_close will be
called twice for the split vma resulting in an underflow.
Fix by taking another reference in hugetlb_vm_op_open. Note that the
reference is only taken for the owner of the reserve map. In the more
common fork case, the pointer to the reserve map is cleared for
non-owning vmas.
Link: https://lkml.kernel.org/r/20210830215015.155224-1-mike.kravetz@oracle.com
Fixes: e9fe92ae0cd2 ("hugetlb_cgroup: add reservation accounting for private mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Suggested-by: Guillaume Morin <guillaume@morinfr.org>
Tested-by: Guillaume Morin <guillaume@morinfr.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>