IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
There are several places that allocate memory for the memory map:
alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
__populate_section_memmap() for SPARSEMEM.
The memory allocated in the FLATMEM case is zeroed and it is never
poisoned, regardless of CONFIG_PAGE_POISON setting.
The memory allocated in the SPARSEMEM cases is not zeroed and it is
implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.
Introduce memmap_alloc() wrapper for memblock allocators that will be used
for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
poisoning consistent for different memory models.
Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Free some vmemmap pages of HugeTLB page", v23.
This patch series will free some vmemmap pages(struct page structures)
associated with each HugeTLB page when preallocated to save memory.
In order to reduce the difficulty of the first version of code review. In
this version, we disable PMD/huge page mapping of vmemmap if this feature
was enabled. This acutely eliminates a bunch of the complex code doing
page table manipulation. When this patch series is solid, we cam add the
code of vmemmap page table manipulation in the future.
The struct page structures (page structs) are used to describe a physical
page frame. By default, there is an one-to-one mapping from a page frame
to it's corresponding page struct.
The HugeTLB pages consist of multiple base page size pages and is
supported by many architectures. See hugetlbpage.rst in the Documentation
directory for more details. On the x86 architecture, HugeTLB pages of
size 2MB and 1GB are currently supported. Since the base page size on x86
is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
page consists of 4096 base pages. For each base page, there is a
corresponding page struct.
Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
provides this upper limit. The only 'useful' information in the remaining
page structs is the compound_head field, and this field is the same for
all tail pages.
By removing redundant page structs for HugeTLB pages, memory can returned
to the buddy allocator for other uses.
When the system boot up, every 2M HugeTLB has 512 struct page structs which
size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | -------------> | 2 |
| | +-----------+ +-----------+
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
| 2MB | +-----------+ +-----------+
| | | 5 | -------------> | 5 |
| | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
| | +-----------+ +-----------+
| |
| |
| |
+-----------+
The value of page->compound_head is the same for all tail pages. The
first page of page structs (page 0) associated with the HugeTLB page
contains the 4 page structs necessary to describe the HugeTLB. The only
use of the remaining pages of page structs (page 1 to page 7) is to point
to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1.
Only 2 pages of page structs will be used for each HugeTLB page. This
will allow us to free the remaining 6 pages to the buddy allocator.
Here is how things look after remapping.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
| | +-----------+ | | | | |
| | | 3 | ------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | --------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | ----------------------+ | |
| | +-----------+ | |
| | | 6 | ------------------------+ |
| | +-----------+ |
| | | 7 | --------------------------+
| | +-----------+
| |
| |
| |
+-----------+
When a HugeTLB is freed to the buddy system, we should allocate 6 pages
for vmemmap pages and restore the previous mapping relationship.
Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
to the 2MB HugeTLB page. We also can use this approach to free the
vmemmap pages.
In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a
very substantial gain. On our server, run some SPDK/QEMU applications
which will use 1024GB HugeTLB page. With this feature enabled, we can
save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.
Because there are vmemmap page tables reconstruction on the
freeing/allocating path, it increases some overhead. Here are some
overhead analysis.
1) Allocating 10240 2MB HugeTLB pages.
a) With this patch series applied:
# time echo 10240 > /proc/sys/vm/nr_hugepages
real 0m0.166s
user 0m0.000s
sys 0m0.166s
# bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
@start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 4 | |
b) Without this patch series:
# time echo 10240 > /proc/sys/vm/nr_hugepages
real 0m0.067s
user 0m0.000s
sys 0m0.067s
# bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
@start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 93 | |
Summarize: this feature is about ~2x slower than before.
2) Freeing 10240 2MB HugeTLB pages.
a) With this patch series applied:
# time echo 0 > /proc/sys/vm/nr_hugepages
real 0m0.213s
user 0m0.000s
sys 0m0.213s
# bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
@start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[8K, 16K) 6 | |
[16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 7 | |
b) Without this patch series:
# time echo 0 > /proc/sys/vm/nr_hugepages
real 0m0.081s
user 0m0.000s
sys 0m0.081s
# bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
@start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 8 | |
Summary: The overhead of __free_hugepage is about ~2-3x slower than before.
Although the overhead has increased, the overhead is not significant.
Like Mike said, "However, remember that the majority of use cases create
HugeTLB pages at or shortly after boot time and add them to the pool. So,
additional overhead is at pool creation time. There is no change to
'normal run time' operations of getting a page from or returning a page to
the pool (think page fault/unmap)".
Despite the overhead and in addition to the memory gains from this series.
The following data is obtained by Joao Martins. Very thanks to his
effort.
There's an additional benefit which is page (un)pinners will see an improvement
and Joao presumes because there are fewer memmap pages and thus the tail/head
pages are staying in cache more often.
Out of the box Joao saw (when comparing linux-next against linux-next +
this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):
get_user_pages(): ~32k -> ~9k
unpin_user_pages(): ~75k -> ~70k
Usually any tight loop fetching compound_head(), or reading tail pages
data (e.g. compound_head) benefit a lot. There's some unpinning
inefficiencies Joao was fixing[2], but with that in added it shows even
more:
unpin_user_pages(): ~27k -> ~3.8k
[1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/
This patch (of 9):
Move bootmem info registration common API to individual bootmem_info.c.
And we will use {get,put}_page_bootmem() to initialize the page for the
vmemmap pages or free the vmemmap pages to buddy in the later patch. So
move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement
without any functional change.
Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I see a "virt_to_phys used for non-linear address" warning from
check_usemap_section_nr() on arm64 platforms.
In current implementation of NODE_DATA, if CONFIG_NEED_MULTIPLE_NODES=y,
pglist_data is dynamically allocated and assigned to node_data[].
For example, in arch/arm64/include/asm/mmzone.h:
extern struct pglist_data *node_data[];
#define NODE_DATA(nid) (node_data[(nid)])
If CONFIG_NEED_MULTIPLE_NODES=n, pglist_data is defined as a global
variable named "contig_page_data".
For example, in include/linux/mmzone.h:
extern struct pglist_data contig_page_data;
#define NODE_DATA(nid) (&contig_page_data)
If CONFIG_DEBUG_VIRTUAL is not enabled, __pa() can handle both
dynamically allocated linear addresses and symbol addresses. However,
if (CONFIG_DEBUG_VIRTUAL=y && CONFIG_NEED_MULTIPLE_NODES=n) we can see
the "virt_to_phys used for non-linear address" warning because that
&contig_page_data is not a linear address on arm64.
Warning message:
virt_to_phys used for non-linear address: (contig_page_data+0x0/0x1c00)
WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Tainted: G W 5.13.0-rc1-00074-g1140ab592e2e #3
Hardware name: linux,dummy-virt (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
Call trace:
__virt_to_phys+0x58/0x68
check_usemap_section_nr+0x50/0xfc
sparse_init_nid+0x1ac/0x28c
sparse_init+0x1c4/0x1e0
bootmem_init+0x60/0x90
setup_arch+0x184/0x1f0
start_kernel+0x78/0x488
To fix it, create a small function to handle both translation.
Link: https://lkml.kernel.org/r/1623058729-27264-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kazu <k-hagio-ab@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
This can even lead to extreme cases where system goes OOM because
the physically hotplugged memory depletes the available memory before
it is onlined.
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.
This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beginning of the hotplugged memory
for that purpose.
There are some non-obviously things to consider though.
Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed. This means that the reserved physical range is not
online although it is used. The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns. The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined. For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g. vmemmap
page tables).
The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.
As per above, the functions that are introduced are:
- mhp_init_memmap_on_memory:
Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
fully span.
- mhp_deinit_memmap_on_memory:
Offlines as many sections as vmemmap pages fully span, removes the
range from zhe zone by remove_pfn_range_from_zone(), and calls
kasan_remove_zero_shadow() for the range.
The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages(). Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.
On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages(). This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty. If offline_pages() fails, we account back
vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory().
Hot-remove:
We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.
If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.
Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().
Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.
Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.
Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For early sections, its memmap is handled specially even sub-section is
enabled. The memmap could only be populated as a whole.
Quoted from the comment of section_activate():
* The early init code does not consider partially populated
* initial sections, it simply assumes that memory will never be
* referenced. If we hot-add memory into such a section then we
* do not need to populate the memmap and can simply reuse what
* is already there.
While current section_deactivate() breaks this rule. When hot-remove a
sub-section, section_deactivate() would depopulate its memmap. The
consequence is if we hot-add this subsection again, its memmap never get
proper populated.
We can reproduce the case by following steps:
1. Hacking qemu to allow sub-section early section
: diff --git a/hw/i386/pc.c b/hw/i386/pc.c
: index 51b3050d01..c6a78d83c0 100644
: --- a/hw/i386/pc.c
: +++ b/hw/i386/pc.c
: @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
: }
:
: machine->device_memory->base =
: - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
: + 0x100000000ULL + x86ms->above_4g_mem_size;
:
: if (pcmc->enforce_aligned_dimm) {
: /* size device region assuming 1G page max alignment per slot */
2. Bootup qemu with PSE disabled and a sub-section aligned memory size
Part of the qemu command would look like this:
sudo x86_64-softmmu/qemu-system-x86_64 \
--enable-kvm -cpu host,pse=off \
-m 4160M,maxmem=20G,slots=1 \
-smp sockets=2,cores=16 \
-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
-machine pc,nvdimm \
-nographic \
-object memory-backend-ram,id=mem0,size=8G \
-device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k
3. Re-config a pmem device with sub-section size in guest
ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
Then you would see the following call trace:
pmem0: detected capacity change from 0 to 16777216
BUG: unable to handle page fault for address: ffffec73c51000b4
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
Oops: 0002 [#1] SMP NOPTI
CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
RIP: 0010:memmap_init_zone+0x154/0x1c2
Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
Call Trace:
move_pfn_range_to_zone+0x128/0x150
memremap_pages+0x4e4/0x5a0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0x69/0x160 [device_dax]
really_probe+0x298/0x3c0
driver_probe_device+0xe1/0x150
? driver_allows_async_probing+0x50/0x50
bus_for_each_drv+0x7e/0xc0
__device_attach+0xdf/0x160
bus_probe_device+0x8e/0xa0
device_add+0x3b9/0x740
__devm_create_dev_dax+0x127/0x1c0
__dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
dax_pmem_probe+0xc/0x1b [dax_pmem]
nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
really_probe+0x147/0x3c0
driver_probe_device+0xe1/0x150
device_driver_attach+0x53/0x60
bind_store+0xd1/0x110
kernfs_fop_write+0xce/0x1b0
vfs_write+0xb6/0x1a0
ksys_write+0x5f/0xe0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: ba72b4c8cf ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200625223534.18024-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, to support subsection aligned memory region adding for pmem,
subsection map is added to track which subsection is present.
However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, it's meaningless. Even worse, it may confuse people when
checking code related to the classic sparse.
About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.
Combining the above reasons, no need to provide subsection map and the
relevant handling for the classic sparse. Let's remove them.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.
Memory sub-section hotplug was added to fix the issue that nvdimm could be
mapped at non-section aligned starting address. A subsection map is added
into struct mem_section_usage to implement it.
However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, subsection map is meaningless and confusing.
About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.
This patch (of 5):
Factor out the code that fills the subsection map from section_activate()
into fill_subsection_map(), this makes section_activate() cleaner and
easier to follow.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocating memmap for hot added memory with the classic sparse, the
specified 'nid' is ignored in populate_section_memmap().
While in allocating memmap for the classic sparse during boot, the node
given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in
both boot stage and memory hot adding. So seems no reason to not respect
the node of 'nid' for the classic sparse when hot adding memory.
Use kvmalloc_node instead to use the passed in 'nid'.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srv
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the crash like this:
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
section_deactivate+0x220/0x240
__remove_pages+0x118/0x170
arch_remove_memory+0x3c/0x150
memunmap_pages+0x1cc/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x2f8/0x3e0
device_release_driver_internal+0x168/0x270
unbind_store+0x130/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x68/0x80
kernfs_fop_write+0x100/0x290
__vfs_write+0x3c/0x70
vfs_write+0xcc/0x240
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The crash is due to NULL dereference at
test_bit(idx, ms->usage->subsection_map);
due to ms->usage = NULL in pfn_section_valid()
With commit d41e2f3bd5 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem(). This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
config pfn_to_page does
__section_mem_map_addr(__sec) + __pfn;
where
static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}
Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()). Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.
static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}
where
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
return test_bit(idx, ms->usage->subsection_map);
}
Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed. For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections. Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.
[aneesh.kumar@linux.ibm.com: add comment]
Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd5 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In section_deactivate(), pfn_to_page() doesn't work any more after
ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
causes a hot remove failure:
kernel BUG at mm/page_alloc.c:4806!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:free_pages+0x85/0xa0
Call Trace:
__remove_pages+0x99/0xc0
arch_remove_memory+0x23/0x4d
try_remove_memory+0xc8/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x72/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x2eb/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40
Let's move the ->section_mem_map resetting after
depopulate_section_memmap() to fix it.
[akpm@linux-foundation.org: remove unneeded initialization, per David]
Fixes: ba72b4c8cf ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit ba72b4c8cf ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.
In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.
Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)
The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile
[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")
Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.
Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.
Can be triggered using memtrace under ppc64/powernv:
$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
------------[ cut here ]------------
kernel BUG at mm/slub.c:3969!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
NIP kfree+0x338/0x3b0
LR section_deactivate+0x138/0x200
Call Trace:
section_deactivate+0x138/0x200
__remove_pages+0x114/0x150
arch_remove_memory+0x3c/0x160
try_remove_memory+0x114/0x1a0
__remove_memory+0x20/0x40
memtrace_enable_set+0x254/0x850
simple_attr_write+0x138/0x160
full_proxy_write+0x8c/0x110
__vfs_write+0x38/0x70
vfs_write+0x11c/0x2a0
ksys_write+0x84/0x140
system_call+0x5c/0x68
---[ end trace 4b053cbd84e0db62 ]---
The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").
Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.
Using Dynamic Memory under a Power DLAPR can trigger it easily.
Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.
Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f83 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
for page management structure, if memory allocation fails from specified
node, it will fall back to allocate from other nodes.
Normally, the page management structure will not exceed 2% of the total
memory, but a large continuous block of allocation is needed. In most
cases, memory allocation from the specified node will succeed, but a
node memory become highly fragmented will fail. we expect to allocate
memory base section rather than by allocating a large block of memory
from other NUMA nodes
Add memblock_alloc_exact_nid_raw() for this situation, which allocate
boot memory block on the exact node. If a large contiguous block memory
allocate fail in sparse_buffer_init(), it will fall back to allocate
small block memory base section.
Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vincent has noticed [1] that there is something unusual with the memmap
allocations going on on his platform
: I noticed this because on my ARM64 platform, with 1 GiB of memory the
: first [and only] section is allocated from the zeroing path while with
: 2 GiB of memory the first 1 GiB section is allocated from the
: non-zeroing path.
The underlying problem is that although sparse_buffer_init allocates
enough memory for all sections on the node sparse_buffer_alloc is not
able to consume them due to mismatch in the expected allocation
alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
alignment the real memmap has to be aligned to section_map_size() this
results in a wasted initial chunk of the preallocated memmap and
unnecessary fallback allocation for a section.
While we are at it also change __populate_section_memmap to align to the
requested size because at least VMEMMAP has constrains to have memmap
properly aligned.
[1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
[akpm@linux-foundation.org: tweak layout, per David]
Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
Fixes: 35fd1eb1e8 ("mm/sparse: abstract sparse buffer allocations")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Debugged-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Building the kernel on s390 with -Og produces the following warning:
WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
The function populate_section_memmap() references
the function __meminit __populate_section_memmap().
This is often because populate_section_memmap lacks a __meminit
annotation or the annotation of __populate_section_memmap is wrong.
While -Og is not supported, in theory this might still happen with
another compiler or on another architecture. So fix this by using the
correct section annotations.
[iii@linux.ibm.com: v2]
Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparsemem without VMEMMAP has two allocation paths to allocate the
memory needed for its memmap (done in sparse_mem_map_populate()).
In one allocation path (sparse_buffer_alloc() succeeds), the memory is
not zeroed (since it was previously allocated with
memblock_alloc_try_nid_raw()).
In the other allocation path (sparse_buffer_alloc() fails and
sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the
memory is zeroed.
AFAICS this difference does not appear to be on purpose. If the code is
supposed to work with non-initialized memory (__init_single_page() takes
care of zeroing the struct pages which are actually used), we should
consistently not zero the memory, to avoid masking bugs.
( I noticed this because on my ARM64 platform, with 1 GiB of memory the
first [and only] section is allocated from the zeroing path while with
2 GiB of memory the first 1 GiB section is allocated from the
non-zeroing path. )
Michal:
"the main user visible problem is a memory wastage. The overal amount
of memory should be small. I wouldn't call it stable material."
Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
after being aligned with the size. However, the size is at least
PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
than PAGE_SIZE.
Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
first, the aligned space before sparsemap_buf is wasted and no one will
touch it.
In our ARM32 platform (without SPARSEMEM_VMEMMAP)
Sparse_buffer_init
Reserve d359c000 - d3e9c000 (9M)
Sparse_buffer_alloc
Alloc d3a00000 - d3E80000 (4.5M)
Sparse_buffer_fini
Free d3e80000 - d3e9c000 (~=100k)
The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.
In ARM64 platform (with SPARSEMEM_VMEMMAP)
sparse_buffer_init
Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
Sparse_buffer_alloc
Alloc ffffffc07d800000 - ffffffc07f600000 (30M)
Sparse_buffer_fini
Free ffffffc07f600000 - ffffffc07f623000 (140K)
The reserved memory between ffffffc07d623000 - ffffffc07d800000
(~=1.9M) is unfreed.
Let's explicit free redundant aligned memory.
[arnd@arndb.de: mark sparse_buffer_free as __meminit]
Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.
For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0
WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]
It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.
A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.
Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.
[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.
The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.
The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.
Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Jane Chu <jane.chu@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>