IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced.
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.
This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].
Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize". In this patch , cheanup configs to make code more
expressive.
Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can specify the number of hugepages to allocate at boot. But the
hugepages is balanced in all nodes at present. In some scenarios, we
only need hugepages in one node. For example: DPDK needs hugepages
which are in the same node as NIC.
If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
nodes we must reserve 64 hugepages on the kernel cmdline. But only four
hugepages are used. The others should be free after boot. If the
system memory is low(for example: 64G), it will be an impossible task.
So extend the hugepages parameter to support specifying hugepages on a
specific node. For example add following parameter:
hugepagesz=1G hugepages=0:1,1:3
It will allocate 1 hugepage in node0 and 3 hugepages in node1.
Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "hugetlb: add demote/split page functionality", v4.
The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common. One of the reasons is better TLB support for
gigantic page sizes on x86 hardware. In addition, hugetlb pages are
being used to back VMs in hosting environments.
When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools. This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup. In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.
In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes. Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain. In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.
Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages. For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
On an idle system this operation is fairly reliable and results are as
expected. The number of 2MB pages is increased as expected and the time
of the operation is a second or two.
However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
will match the size of the larger page which was freed. This is
because the area freed by the larger page could quickly be
fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:
Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real 0m42.092s
user 0m0.008s
sys 0m41.467s
To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size. This avoids freeing pages to buddy and then
trying to allocate from buddy.
Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size
Target page size for demotion, a smaller huge page size. File
can be written to chose a smaller huge page size if multiple are
available.
- demote
Writable number of hugetlb pages to be demoted
To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote
Only hugetlb pages which are free at the time of the request can be
demoted. Demotion does not add to the complexity of surplus pages and
honors reserved huge pages. Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted. It is possible fewer will actually be demoted. The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.
Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86. Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced. The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.
This patch (of 5):
Two new sysfs files are added to demote hugtlb pages. These files are
both per-hugetlb page size and per node. Files are:
demote_size - The size in Kb that pages are demoted to. (read-write)
demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size. Valid huge
page sizes less than huge page size may be written to this file. When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size. For example, on x86 1GB huge
pages will have demote interfaces. 2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality. It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.
We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
is enabled. Because vmemmap_remap_free() depends on vmemmap being base
page mapped.
Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it. However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure. In
this case, we just refuse to free the HugeTLB page. This changes behavior
in some corner cases as listed below:
1) Failing to free a huge page triggered by the user (decrease nr_pages).
User needs to try again later.
2) Failing to free a surplus huge page when freed by the application.
Try again later when freeing a huge page next time.
3) Failing to dissolve a free huge page on ZONE_MOVABLE via
offline_pages().
This can happen when we have plenty of ZONE_MOVABLE memory, but
not enough kernel memory to allocate vmemmmap pages. We may even
be able to migrate huge page contents, but will not be able to
dissolve the source huge page. This will prevent an offline
operation and is unfortunate as memory offlining is expected to
succeed on movable zones. Users that depend on memory hotplug
to succeed for movable zones should carefully consider whether the
memory savings gained from this feature are worth the risk of
possibly not being able to offline memory in certain situations.
4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
alloc_contig_range() - once we have that handling in place. Mainly
affects CMA and virtio-mem.
Similar to 3). virito-mem will handle migration errors gracefully.
CMA might be able to fallback on other free areas within the CMA
region.
Vmemmap pages are allocated from the page freeing context. In order for
those allocations to be not disruptive (e.g. trigger oom killer)
__GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure. GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.
[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some new command line parameters were added at hugetlbpage.rst.
Adjust them in order to properly parse that part of the file,
avoiding those warnings:
Documentation/admin-guide/mm/hugetlbpage.rst:105: WARNING: Unexpected indentation.
Documentation/admin-guide/mm/hugetlbpage.rst:108: WARNING: Unexpected indentation.
Documentation/admin-guide/mm/hugetlbpage.rst:109: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/mm/hugetlbpage.rst:112: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/mm/hugetlbpage.rst:120: WARNING: Unexpected indentation.
Documentation/admin-guide/mm/hugetlbpage.rst:121: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/admin-guide/mm/hugetlbpage.rst:132: WARNING: Unexpected indentation.
Documentation/admin-guide/mm/hugetlbpage.rst:135: WARNING: Block quote ends without a blank line; unexpected unindent.
Fixes: cd9fa28b5351 ("hugetlbfs: clean up command line processing")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/86b6796b1a84e18b24314ecd29318951c1479ca2.1592895969.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
With all hugetlb page processing done in a single file clean up code.
- Make code match desired semantics
- Update documentation with semantics
- Make all warnings and errors messages start with 'HugeTLB:'.
- Consistently name command line parsing routines.
- Warn if !hugepages_supported() and command line parameters have
been specified.
- Add comments to code
- Describe some of the subtle interactions
- Describe semantics of command line arguments
This patch also fixes issues with implicitly setting the number of
gigantic huge pages to preallocate. Previously on X86 command line,
hugepages=2 default_hugepagesz=1G
would result in zero 1G pages being preallocated and,
# grep HugePages_Total /proc/meminfo
HugePages_Total: 0
# sysctl -a | grep nr_hugepages
vm.nr_hugepages = 2
vm.nr_hugepages_mempolicy = 2
# cat /proc/sys/vm/nr_hugepages
2
After this patch 2 gigantic pages will be preallocated and all the proc,
sysfs, sysctl and meminfo files will accurately reflect this.
To address the issue with gigantic pages, a small change in behavior was
made to command line processing. Previously the command line,
hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256
would result in the allocation of 256 2M huge pages. The value 128 would
be ignored without any warning. After this patch, 128 2M pages will be
allocated and a warning message will be displayed indicating the value of
256 is ignored. This change in behavior is required because allocation of
implicitly specified gigantic pages must be done when the
default_hugepagesz= is encountered for gigantic pages. Previously the
code waited until later in the boot process (hugetlb_init), to allocate
pages of default size. However the bootmem allocator required for
gigantic allocations is not available at this time.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sandipan Das <sandipan@linux.ibm.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The document describes userspace API and as such it belongs to
Documentation/admin-guide/mm
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Several documents in Documentation/vm fit quite well into the "admin/user
guide" category. The documents that don't overload the reader with lots of
implementation details and provide coherent description of certain feature
can be moved to Documentation/admin-guide/mm.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>