ac35a49023
To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
116 lines
3.5 KiB
C
116 lines
3.5 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef PAGE_FLAGS_LAYOUT_H
|
|
#define PAGE_FLAGS_LAYOUT_H
|
|
|
|
#include <linux/numa.h>
|
|
#include <generated/bounds.h>
|
|
|
|
/*
|
|
* When a memory allocation must conform to specific limitations (such
|
|
* as being suitable for DMA) the caller will pass in hints to the
|
|
* allocator in the gfp_mask, in the zone modifier bits. These bits
|
|
* are used to select a priority ordered list of memory zones which
|
|
* match the requested limits. See gfp_zone() in include/linux/gfp.h
|
|
*/
|
|
#if MAX_NR_ZONES < 2
|
|
#define ZONES_SHIFT 0
|
|
#elif MAX_NR_ZONES <= 2
|
|
#define ZONES_SHIFT 1
|
|
#elif MAX_NR_ZONES <= 4
|
|
#define ZONES_SHIFT 2
|
|
#elif MAX_NR_ZONES <= 8
|
|
#define ZONES_SHIFT 3
|
|
#else
|
|
#error ZONES_SHIFT "Too many zones configured"
|
|
#endif
|
|
|
|
#define ZONES_WIDTH ZONES_SHIFT
|
|
|
|
#ifdef CONFIG_SPARSEMEM
|
|
#include <asm/sparsemem.h>
|
|
#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
|
|
#else
|
|
#define SECTIONS_SHIFT 0
|
|
#endif
|
|
|
|
#ifndef BUILD_VDSO32_64
|
|
/*
|
|
* page->flags layout:
|
|
*
|
|
* There are five possibilities for how page->flags get laid out. The first
|
|
* pair is for the normal case without sparsemem. The second pair is for
|
|
* sparsemem when there is plenty of space for node and section information.
|
|
* The last is when there is insufficient space in page->flags and a separate
|
|
* lookup is necessary.
|
|
*
|
|
* No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
|
|
* " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
|
|
* classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
|
|
* " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
|
|
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
|
|
*/
|
|
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
|
|
#define SECTIONS_WIDTH SECTIONS_SHIFT
|
|
#else
|
|
#define SECTIONS_WIDTH 0
|
|
#endif
|
|
|
|
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
|
|
<= BITS_PER_LONG - NR_PAGEFLAGS
|
|
#define NODES_WIDTH NODES_SHIFT
|
|
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
|
|
#error "Vmemmap: No space for nodes field in page flags"
|
|
#else
|
|
#define NODES_WIDTH 0
|
|
#endif
|
|
|
|
/*
|
|
* Note that this #define MUST have a value so that it can be tested with
|
|
* the IS_ENABLED() macro.
|
|
*/
|
|
#if NODES_SHIFT != 0 && NODES_WIDTH == 0
|
|
#define NODE_NOT_IN_PAGE_FLAGS 1
|
|
#endif
|
|
|
|
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
|
|
#define KASAN_TAG_WIDTH 8
|
|
#else
|
|
#define KASAN_TAG_WIDTH 0
|
|
#endif
|
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
|
#define LAST__PID_SHIFT 8
|
|
#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
|
|
|
|
#define LAST__CPU_SHIFT NR_CPUS_BITS
|
|
#define LAST__CPU_MASK ((1 << LAST__CPU_SHIFT)-1)
|
|
|
|
#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
|
|
#else
|
|
#define LAST_CPUPID_SHIFT 0
|
|
#endif
|
|
|
|
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
|
|
KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
|
|
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
|
|
#else
|
|
#define LAST_CPUPID_WIDTH 0
|
|
#endif
|
|
|
|
#if LAST_CPUPID_SHIFT != 0 && LAST_CPUPID_WIDTH == 0
|
|
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
|
|
#endif
|
|
|
|
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
|
|
KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
|
|
#error "Not enough bits in page flags"
|
|
#endif
|
|
|
|
/* see the comment on MAX_NR_TIERS */
|
|
#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
|
|
ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
|
|
NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
|
|
|
|
#endif
|
|
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
|