linux

iv/linux

Go to file

Yang Shi c6a7f445a2 mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA

Patch series "mm: userspace hugepage collapse", v7.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[5].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	This operation is independent of the system THP sysfs settings,
	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
	when the memory is hot, the implementation could
	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
	hugepage coverage and dTLB performance.  TCMalloc is such an
	implementation that could benefit from this[6].  A prior study of
	Google internal workloads during evaluation of Temeraire, a
	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
	all cpu cycles were spent in dTLB stalls, and that increasing
	hugepage coverage by even small amount can help with that[7].

(3)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.  Note that
	subsequent support for file/shmem-backed memory is required here.

(4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
	be mapped at different levels in the page tables[8].  As it's not
	"transparent" like THP, HugeTLB high-granularity mappings require
	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
	for this use case[9].  Note that subsequent support for HugeTLB
	memory is required here.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].


This patch (of 17):

The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much
sense to carry it.

But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing.  Then the page
would be just freed if khugepaged already made enough progress.  This
could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
run.  This problem actually makes things worse due to the way more
pointless THP allocations and makes the optimization pointless.

This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried indefinitely.
But if we take one step back, the optimization itself seems not worth
keeping nowadays since:

  * Not too many users build NUMA=n kernel nowadays even though the kernel is
    actually running on a non-NUMA machine. Some small devices may run NUMA=n
    kernel, but I don't think they actually use THP.
  * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), THP could be cached by pcp.  This actually
    somehow does the job done by the optimization.

Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@google.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Co-developed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2022-09-11 20:25:44 -07:00

arch

Misc fixes:

2022-08-28 10:10:23 -07:00

block

block-6.0-2022-08-26

2022-08-26 11:05:54 -07:00

certs

Kbuild updates for v5.20

2022-08-10 10:40:41 -07:00

crypto

crypto: blake2b: effectively disable frame size warning

2022-08-10 17:59:11 -07:00

Documentation

Misc fixes:

2022-08-28 10:10:23 -07:00

drivers

Seventeen hotfixes. Mostly memory management things. Ten patches are

2022-08-28 14:49:59 -07:00

Seventeen hotfixes. Mostly memory management things. Ten patches are

2022-08-28 14:49:59 -07:00

include

Seventeen hotfixes. Mostly memory management things. Ten patches are

2022-08-28 14:49:59 -07:00

init

arm64 fixes for -rc3

2022-08-26 11:32:53 -07:00

io_uring

io_uring/net: save address for sendzc async execution

2022-08-25 07:52:30 -06:00

ipc

Updates to various subsystems which I help look after. lib, ocfs2,

2022-08-07 10:03:24 -07:00

kernel

Seventeen hotfixes. Mostly memory management things. Ten patches are

2022-08-28 14:49:59 -07:00

lib

bitmap fixes for v6.0-rc3

2022-08-28 14:36:27 -07:00

LICENSES

LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers

2021-12-16 14:33:10 +01:00

mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA

2022-09-11 20:25:44 -07:00

net

Including fixes from ipsec and netfilter (with one broken Fixes tag).

2022-08-25 14:03:58 -07:00

samples

Tracing updates for 5.20 / 6.0

2022-08-05 09:41:12 -07:00

scripts

asm goto: eradicate CC_HAS_ASM_GOTO

2022-08-21 10:06:28 -07:00

security

hardening fixes for v6.0-rc2

2022-08-19 13:56:14 -07:00

sound

sound fixes for 6.0-rc2

2022-08-19 09:46:11 -07:00

tools

Misc fixes:

2022-08-28 10:10:23 -07:00

usr

Not a lot of material this cycle. Many singleton patches against various

2022-05-27 11:22:03 -07:00

virt

KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()

2022-08-19 04:05:43 -04:00

.clang-format

PCI/DOE: Add DOE mailbox support functions

2022-07-19 15:38:04 -07:00

.cocciconfig

…

.get_maintainer.ignore

get_maintainer: add Alan to .get_maintainer.ignore

2022-08-20 15:17:44 -07:00

.gitattributes

.gitattributes: use 'dts' diff driver for dts files

2019-12-04 19:44:11 -08:00

.gitignore

kbuild: split the second line of *.mod into *.usyms

2022-05-08 03:16:59 +09:00

.mailmap

.mailmap: update Luca Ceresoli's e-mail address

2022-08-28 14:02:46 -07:00

COPYING

COPYING: state that all contributions really are covered by this file

2020-02-10 13:32:20 -08:00

CREDITS

drm for 5.20/6.0

2022-08-03 19:52:08 -07:00

Kbuild

kbuild: rename hostprogs-y/always to hostprogs/always-y

2020-02-04 01:53:07 +09:00

Kconfig

kbuild: ensure full rebuild when the compiler is updated

2020-05-12 13:28:33 +09:00

MAINTAINERS

bitmap fixes for v6.0-rc3

2022-08-28 14:36:27 -07:00

Makefile

Linux 6.0-rc3

2022-08-28 15:05:29 -07:00

README

Drop all 00-INDEX files from Documentation/

2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.

Languages

C 97.6%

Assembly 1%

Shell 0.5%

Python 0.3%

Makefile 0.3%