linux

iv/linux

History

Johannes Weiner 17edeb5d3f mm: page_alloc: remove pcppage migratetype caching

Patch series "mm: page_alloc: freelist migratetype hygiene", v4.

The page allocator's mobility grouping is intended to keep unmovable pages
separate from reclaimable/compactable ones to allow on-demand
defragmentation for higher-order allocations and huge pages.

Currently, there are several places where accidental type mixing occurs:
an allocation asks for a page of a certain migratetype and receives
another.  This ruins pageblocks for compaction, which in turn makes
allocating huge pages more expensive and less reliable.

The series addresses those causes.  The last patch adds type checks on all
freelist movements to prevent new violations being introduced.

The benefits can be seen in a mixed workload that stresses the machine
with a memcache-type workload and a kernel build job while periodically
attempting to allocate batches of THP.  The following data is aggregated
over 50 consecutive defconfig builds:

                                                        VANILLA                 PATCHED
Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)

Huge page success rate is higher, allocation latencies are shorter and
more predictable.

Stealing (fallback) rate is drastically reduced.  Notably, while the
vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
kernel enters a steady state once the distribution of block types is
adequate for the workload.  Steals over 50 runs:

VANILLA         PATCHED
1504.0		227.0
1557.0		6.0
1391.0		13.0
1080.0		26.0
1057.0		40.0
1156.0		6.0
805.0		46.0
736.0		20.0
1747.0		2.0
1699.0		34.0
1269.0		13.0
1858.0		12.0
907.0		4.0
727.0		2.0
563.0		2.0
3094.0		2.0
10211.0		3.0
2621.0		1.0
5508.0		2.0
1060.0		2.0
538.0		3.0
5773.0		2.0
2199.0		0.0
3781.0		2.0
1387.0		1.0
4977.0		0.0
2865.0		1.0
1814.0		1.0
3739.0		1.0
6857.0		0.0
382.0		0.0
407.0		1.0
3784.0		0.0
297.0		0.0
298.0		0.0
6636.0		0.0
4188.0		0.0
242.0		0.0
9960.0		0.0
5816.0		0.0
354.0		0.0
287.0		0.0
261.0		0.0
140.0		1.0
2065.0		0.0
312.0		0.0
331.0		0.0
164.0		0.0
465.0		1.0
219.0		0.0

Type mismatches are down too.  Those count every time an allocation
request asks for one migratetype and gets another.  This can still occur
minimally in the patched kernel due to non-stealing fallbacks, but it's
quite rare and follows the pattern of overall fallbacks - once the block
type distribution settles, mismatches cease as well:

VANILLA:        PATCHED:
182602.0	268.0
135794.0	20.0
88619.0		19.0
95973.0		0.0
129590.0	0.0
129298.0	0.0
147134.0	0.0
230854.0	0.0
239709.0	0.0
137670.0	0.0
132430.0	0.0
65712.0		0.0
57901.0		0.0
67506.0		0.0
63565.0		4.0
34806.0		0.0
42962.0		0.0
32406.0		0.0
38668.0		0.0
61356.0		0.0
57800.0		0.0
41435.0		0.0
83456.0		0.0
65048.0		0.0
28955.0		0.0
47597.0		0.0
75117.0		0.0
55564.0		0.0
38280.0		0.0
52404.0		0.0
26264.0		0.0
37538.0		0.0
19671.0		0.0
30936.0		0.0
26933.0		0.0
16962.0		0.0
44554.0		0.0
46352.0		0.0
24995.0		0.0
35152.0		0.0
12823.0		0.0
21583.0		0.0
18129.0		0.0
31693.0		0.0
28745.0		0.0
33308.0		0.0
31114.0		0.0
35034.0		0.0
12111.0		0.0
24885.0		0.0

Compaction work is markedly reduced despite much better THP rates.

In the vanilla kernel, reclaim seems to have been driven primarily by
watermark boosting that happens as a result of fallbacks.  With those all
but eliminated, watermarks average lower and kswapd does less work.  The
uptick in direct reclaim is because THP requests have to fend for
themselves more often - which is intended policy right now.  Aggregate
reclaim activity is lowered significantly, though.


This patch (of 10):

The idea behind the cache is to save get_pageblock_migratetype() lookups
during bulk freeing.  A microbenchmark suggests this isn't helping,
though.  The pcp migratetype can get stale, which means that bulk freeing
has an extra branch to check if the pageblock was isolated while on the
pcp.

While the variance overlaps, the cache write and the branch seem to make
this a net negative.  The following test allocates and frees batches of
10,000 pages (~3x the pcp high marks to trigger flushing):

Before:
          8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
                19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
                 0      cpu-migrations                   #    0.000 /sec
            17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
    41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
   126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
    25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
        33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )

         0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )

After:
          8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
                22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
                 0      cpu-migrations                   #    0.000 /sec
            17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
    40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
   126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
    25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
        32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )

         0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )

A side effect is that this also ensures that pages whose pageblock gets
stolen while on the pcplist end up on the right freelist and we don't
perform potentially type-incompatible buddy merges (or skip merges when we
shouldn't), which is likely beneficial to long-term fragmentation
management, although the effects would be harder to measure.  Settle for
simpler and faster code as justification here.

Link: https://lkml.kernel.org/r/20240320180429.678181-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20240320180429.678181-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2024-04-25 20:56:02 -07:00

damon

mm: madvise: pageout: ignore references rather than clearing young

2024-03-04 17:01:18 -08:00

kasan

fix missing vmalloc.h includes

2024-04-25 20:55:49 -07:00

kfence

mm: introduce slabobj_ext to support slab object extensions

2024-04-25 20:55:51 -07:00

kmsan

mm: kmsan: remove runtime checks from kmsan_unpoison_memory()

2024-02-22 10:24:41 -08:00

backing-dev.c

vfs-6.9.misc

2024-03-11 09:38:17 -07:00

balloon_compaction.c

…

bootmem_info.c

bootmem: use kmemleak_free_part_phys in put_page_bootmem

2023-10-25 16:47:13 -07:00

cma_debug.c

…

cma_sysfs.c

mm/cma: add sysfs file 'release_pages_success'

2024-02-22 10:24:57 -08:00

cma.c

mm/cma: add sysfs file 'release_pages_success'

2024-02-22 10:24:57 -08:00

cma.h

mm/cma: add sysfs file 'release_pages_success'

2024-02-22 10:24:57 -08:00

compaction.c

mm: enable page allocation tagging

2024-04-25 20:55:54 -07:00

debug_page_alloc.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

debug_page_ref.c

…

debug_vm_pgtable.c

fix missing vmalloc.h includes

2024-04-25 20:55:49 -07:00

debug.c

mm: improve dumping of mapcount and page_type

2024-04-25 20:56:00 -07:00

dmapool_test.c

dmapool: add alloc/free performance test

2023-04-05 19:42:38 -07:00

dmapool.c

mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs

2023-12-05 11:17:58 +01:00

early_ioremap.c

mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()

2023-06-09 16:25:56 -07:00

fadvise.c

mm: remove unnecessary pagevec includes

2023-06-23 16:59:31 -07:00

fail_page_alloc.c

mm: page_alloc: split out FAIL_PAGE_ALLOC

2023-06-09 16:25:23 -07:00

failslab.c

…

filemap.c

mm: enable page allocation tagging

2024-04-25 20:55:54 -07:00

folio-compat.c

mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictable

2023-12-29 11:58:27 -08:00

gup_test.c

Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.

2023-06-23 16:58:19 -07:00

gup_test.h

…

gup.c

mm/treewide: replace pXd_huge() with pXd_leaf()

2024-04-25 20:55:46 -07:00

highmem.c

x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs

2023-12-29 12:22:28 -08:00

hmm.c

mm/treewide: replace pXd_huge() with pXd_leaf()

2024-04-25 20:55:46 -07:00

huge_memory.c

mm: remove folio_prep_large_rmappable()

2024-04-25 20:56:00 -07:00

hugetlb_cgroup.c

mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER

2023-10-18 14:34:17 -07:00

hugetlb_vmemmap.c

mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range()

2023-12-12 10:57:08 -08:00

hugetlb_vmemmap.h

mm: hugetlb_vmemmap: fix reference to nonexistent file

2023-10-25 16:47:14 -07:00

hugetlb.c

hugetlb: remove mention of destructors

2024-04-25 20:56:01 -07:00

hwpoison-inject.c

…

init-mm.c

mm: Deprecate pasid field

2023-12-12 10:11:32 +01:00

internal.h

mm: remove folio_prep_large_rmappable()

2024-04-25 20:56:00 -07:00

interval_tree.c

…

io-mapping.c

…

ioremap.c

mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed

2023-08-18 10:12:36 -07:00

Kconfig

Kbuild updates for v6.9

2024-03-21 14:41:00 -07:00

Kconfig.debug

mm/slub: unify all sl[au]b parameters with "slab_$param"

2024-01-22 10:31:08 +01:00

khugepaged.c

mm: convert free_swap_cache() to take a folio

2024-03-04 17:01:26 -08:00

kmemleak.c

mm/slub: avoid recursive loop with kmemleak

2024-04-25 20:55:59 -07:00

ksm.c

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

2023-12-29 11:58:56 -08:00

list_lru.c

mm/zswap: stop lru list shrinking when encounter warm region

2024-02-22 10:24:54 -08:00

maccess.c

mm: Fix copy_from_user_nofault().

2023-04-12 17:36:23 -07:00

madvise.c

mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE)

2024-04-25 20:55:43 -07:00

Makefile

kbuild: make -Woverride-init warnings more consistent

2024-03-31 11:32:26 +09:00

mapping_dirty_helpers.c

mm: fix clean_record_shared_mapping_range kernel-doc

2023-08-24 16:20:30 -07:00

memblock.c

cxl fixes for 6.8-rc6

2024-02-24 15:53:40 -08:00

memcontrol.c

mm: always initialise folio->_deferred_list

2024-04-25 20:55:59 -07:00

memfd.c

mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()

2024-03-04 17:01:21 -08:00

memory_hotplug.c

mm/memory_hotplug: export mhp_supports_memmap_on_memory()

2024-02-22 10:24:40 -08:00

memory-failure.c

mm: free up PG_slab

2024-04-25 20:56:00 -07:00

memory-tiers.c

mm/demotion: print demotion targets

2024-02-22 10:24:55 -08:00

memory.c

mm/mempolicy: use numa_node_id() instead of cpu_to_node()

2024-04-25 20:55:48 -07:00

mempolicy.c

mm: enable page allocation tagging

2024-04-25 20:55:54 -07:00

mempool.c

mempool: hook up to memory allocation profiling

2024-04-25 20:55:56 -07:00

memremap.c

mm: remove stale example from comment

2023-12-29 11:58:26 -08:00

memtest.c

memtest: use {READ,WRITE}_ONCE in memory scanning

2024-03-13 12:12:21 -07:00

migrate_device.c

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

2023-12-29 11:58:56 -08:00

migrate.c

merge mm-hotfixes-stable into mm-nonmm-stable to pick up stackdepot changes

2024-02-23 17:28:43 -08:00

mincore.c

mm: enable page walking API to lock vmas during the walk

2023-08-21 13:07:20 -07:00

mlock.c

mm: make folios_put() the basis of release_pages()

2024-03-04 17:01:22 -08:00

mm_init.c

codetag: debug: mark codetags for reserved pages as empty

2024-04-25 20:55:58 -07:00

mm_slot.h

…

mmap_lock.c

…

mmap.c

RISC-V Patches for the 6.9 Merge Window

2024-03-22 10:41:13 -07:00

mmu_gather.c

mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-22 15:27:17 -08:00

mmu_notifier.c

mmu_notifiers: rename invalidate_range notifier

2023-08-18 10:12:41 -07:00

mmzone.c

zswap: shrink zswap pool based on memory pressure

2023-12-12 10:57:02 -08:00

mprotect.c

mprotect: use pfn_swap_entry_folio

2024-02-21 16:00:03 -08:00

mremap.c

mm: abstract VMA merge and extend into vma_merge_extend() helper

2023-10-18 14:34:18 -07:00

msync.c

…

nommu.c

mm: vmalloc: enable memory allocation profiling

2024-04-25 20:55:57 -07:00

oom_kill.c

mm: update mark_victim tracepoints fields

2024-03-04 17:01:16 -08:00

page_alloc.c

mm: page_alloc: remove pcppage migratetype caching

2024-04-25 20:56:02 -07:00

page_counter.c

…

page_ext.c

mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y

2024-04-25 20:55:54 -07:00

page_idle.c

…

page_io.c

zswap: memcontrol: implement zswap writeback disabling

2023-12-29 20:22:11 -08:00

page_isolation.c

mm: add alloc_contig_migrate_range allocation statistics

2024-03-04 17:01:27 -08:00

page_owner.c

mm: introduce slabobj_ext to support slab object extensions

2024-04-25 20:55:51 -07:00

page_poison.c

mm/page_poison: replace kmap_atomic() with kmap_local_page()

2023-12-10 16:51:50 -08:00

page_reporting.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

page_reporting.h

…

page_table_check.c

mm: convert page_table_check_pte_set() to page_table_check_ptes_set()

2023-08-24 16:20:18 -07:00

page_vma_mapped.c

mm: thp: introduce multi-size THP sysfs interface

2023-12-20 14:48:12 -08:00

page-writeback.c

writeback: remove a use of write_cache_pages() from do_writepages()

2024-02-23 17:48:38 -08:00

pagewalk.c

mm: pagewalk: assert write mmap lock only for walking the user page tables

2023-12-10 16:51:53 -08:00

percpu-internal.h

mm: percpu: add codetag reference into pcpuobj_ext

2024-04-25 20:55:56 -07:00

percpu-km.c

…

percpu-stats.c

…

percpu-vm.c

percpu: clean up all mappings when pcpu_map_pages() fails

2024-04-25 20:55:49 -07:00

percpu.c

mm: percpu: enable per-cpu allocation tagging

2024-04-25 20:55:56 -07:00

pgalloc-track.h

…

pgtable-generic.c

mm/pgtable: notes on pte_offset_map[_lock]()

2023-08-18 10:12:25 -07:00

process_vm_access.c

mm: fix process_vm_rw page counts

2023-12-10 16:51:39 -08:00

ptdump.c

mm: ptdump: add check_wx_pages debugfs attribute

2024-02-22 10:24:47 -08:00

readahead.c

mm: support order-1 folios in the page cache

2024-03-04 17:01:19 -08:00

rmap.c

rmap: replace two calls to compound_order with folio_order

2024-02-22 15:27:20 -08:00

rodata_test.c

…

secretmem.c

mm/secretmem: use a folio in secretmem_fault()

2023-08-21 13:38:02 -07:00

shmem_quota.c

tmpfs: fix race on handling dquot rbtree

2024-03-26 11:07:23 -07:00

shmem.c

mm/shmem: inline shmem_is_huge() for disabled transparent hugepages

2024-04-16 15:39:51 -07:00

show_mem.c

lib: add memory allocations report in show_mem()

2024-04-25 20:55:57 -07:00

shrinker_debug.c

mm: shrinker: convert shrinker_rwsem to mutex

2023-10-04 10:32:26 -07:00

shrinker.c

mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info()

2024-01-05 09:58:32 -08:00

shuffle.c

…

shuffle.h

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

slab_common.c

mm/slab: enable slab allocation tagging for kmalloc and friends

2024-04-25 20:55:55 -07:00

slab.h

mm: free up PG_slab

2024-04-25 20:56:00 -07:00

slub.c

mm/slub: avoid recursive loop with kmemleak

2024-04-25 20:55:59 -07:00

sparse-vmemmap.c

mm/vmemmap: allow architectures to override how vmemmap optimization works

2023-08-18 10:12:53 -07:00

sparse.c

mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers

2024-02-21 16:00:01 -08:00

swap_cgroup.c

…

swap_slots.c

mm/zswap: invalidate zswap entry when swap entry free

2024-02-22 10:24:54 -08:00

swap_state.c

mm: convert free_swap_cache() to take a folio

2024-03-04 17:01:26 -08:00

swap.c

mm: fix list corruption in put_pages_list

2024-03-12 13:07:16 -07:00

swap.h

mm/swap: fix race when skipping swapcache

2024-02-20 14:20:48 -08:00

swapfile.c

- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames

2024-03-14 17:43:30 -07:00

truncate.c

fs: convert error_remove_page to error_remove_folio

2023-12-10 16:51:42 -08:00

usercopy.c

mm: Fix copy_from_user_nofault().

2023-04-12 17:36:23 -07:00

userfaultfd.c

userfaultfd: fix deadlock warning when locking src and dst VMAs

2024-03-26 11:07:23 -07:00

util.c

mm: vmalloc: enable memory allocation profiling

2024-04-25 20:55:57 -07:00

vmalloc.c

mm: vmalloc: enable memory allocation profiling

2024-04-25 20:55:57 -07:00

vmpressure.c

eventfd: simplify eventfd_signal()

2023-11-28 14:08:38 +01:00

vmscan.c

- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames

2024-03-14 17:43:30 -07:00

vmstat.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

workingset.c

mm: move mapping_set_update out of <linux/swap.h>

2024-02-21 11:36:50 +05:30

z3fold.c

mm: zpool: return pool size in pages

2024-04-25 20:55:48 -07:00

zbud.c

mm: zpool: return pool size in pages

2024-04-25 20:55:48 -07:00

zpool.c

mm: zpool: return pool size in pages

2024-04-25 20:55:48 -07:00

zsmalloc.c

mm: zpool: return pool size in pages

2024-04-25 20:55:48 -07:00

zswap.c

mm: zswap: remove unnecessary check in zswap_find_zpool()

2024-04-25 20:55:48 -07:00