115d9d77bb
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes:
|
||
---|---|---|
.. | ||
asm | ||
lib | ||
linux | ||
scripts | ||
tests | ||
.gitignore | ||
internal.h | ||
main.c | ||
Makefile | ||
mmzone.c | ||
README | ||
TODO |
================== Memblock simulator ================== Introduction ============ Memblock is a boot time memory allocator[1] that manages memory regions before the actual memory management is initialized. Its APIs allow to register physical memory regions, mark them as available or reserved, allocate a block of memory within the requested range and/or in specific NUMA node, and many more. Because it is used so early in the booting process, testing and debugging it is difficult. This test suite, usually referred as memblock simulator, is an attempt at testing the memblock mechanism. It runs one monolithic test that consist of a series of checks that exercise both the basic operations and allocation functionalities of memblock. The main data structure of the boot time memory allocator is initialized at the build time, so the checks here reuse its instance throughout the duration of the test. To ensure that tests don't affect each other, region arrays are reset in between. As this project uses the actual memblock code and has to run in user space, some of the kernel definitions were stubbed by the initial commit that introduced memblock simulator (commit 16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")) and a few preparation commits just before it. Most of them don't match the kernel implementation, so one should consult them first before making any significant changes to the project. Usage ===== To run the tests, build the main target and run it: $ make && ./main A successful run produces no output. It is possible to control the behavior by passing options from command line. For example, to include verbose output, append the `-v` options when you run the tests: $ ./main -v This will print information about which functions are being tested and the number of test cases that passed. For the full list of options from command line, see `./main --help`. It is also possible to override different configuration parameters to change the test functions. For example, to simulate enabled NUMA, use: $ make NUMA=1 For the full list of build options, see `make help`. Project structure ================= The project has one target, main, which calls a group of checks for basic and allocation functions. Tests for each group are defined in dedicated files, as it can be seen here: memblock |-- asm ------------------, |-- lib |-- implement function and struct stubs |-- linux ------------------' |-- scripts | |-- Makefile.include -- handles `make` parameters |-- tests | |-- alloc_api.(c|h) -- memblock_alloc tests | |-- alloc_helpers_api.(c|h) -- memblock_alloc_from tests | |-- alloc_nid_api.(c|h) -- memblock_alloc_try_nid tests | |-- basic_api.(c|h) -- memblock_add/memblock_reserve/... tests | |-- common.(c|h) -- helper functions for resetting memblock; |-- main.c --------------. dummy physical memory definition |-- Makefile `- test runner |-- README |-- TODO |-- .gitignore Simulating physical memory ========================== Some allocation functions clear the memory in the process, so it is required for memblock to track valid memory ranges. To achieve this, the test suite registers with memblock memory stored by test_memory struct. It is a small wrapper that points to a block of memory allocated via malloc. For each group of allocation tests, dummy physical memory is allocated, added to memblock, and then released at the end of the test run. The structure of a test runner checking allocation functions is as follows: int memblock_alloc_foo_checks(void) { reset_memblock_attributes(); /* data structure reset */ dummy_physical_memory_init(); /* allocate and register memory */ (...allocation checks...) dummy_physical_memory_cleanup(); /* free the memory */ } There's no need to explicitly free the dummy memory from memblock via memblock_free() call. The entry will be erased by reset_memblock_regions(), called at the beginning of each test. Known issues ============ 1. Requesting a specific NUMA node via memblock_alloc_node() does not work as intended. Once the fix is in place, tests for this function can be added. 2. Tests for memblock_alloc_low() can't be easily implemented. The function uses ARCH_LOW_ADDRESS_LIMIT marco, which can't be changed to point at the low memory of the memory_block. References ========== 1. Boot time memory management documentation page: https://www.kernel.org/doc/html/latest/core-api/boot-time-mm.html