linux

iv/linux

History

Johannes Weiner 51b8c1fe25 vfs: keep inodes with page cache off the inode shrinker LRU Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit `a76cf1a474` ("mm: don't reclaim inodes with many attached pages")) and failed (commit `69056ee6a8` ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2021-11-09 10:02:48 -08:00
..
damon	mm/damon: remove return value from before_terminate callback	2021-11-06 13:30:46 -07:00
kasan	kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC	2021-11-06 13:30:37 -07:00
kfence	kfence: always use static branches to guard kfence_alloc()	2021-11-06 13:30:43 -07:00
backing-dev.c	mm/vmscan: throttle reclaim until some writeback completes if congested	2021-11-06 13:30:40 -07:00
balloon_compaction.c	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
bootmem_info.c	mm/bootmem_info.c: mark __init on register_page_bootmem_info_section	2021-09-03 09:58:14 -07:00
cleancache.c
cma_debug.c	mm/cma: change cma mutex to irq safe spinlock	2021-05-05 11:27:21 -07:00
cma_sysfs.c	mm: cma: support sysfs	2021-05-05 11:27:24 -07:00
cma.c	memblock: rename memblock_free to memblock_phys_free	2021-11-06 13:30:41 -07:00
cma.h	mm: cma: support sysfs	2021-05-05 11:27:24 -07:00
compaction.c	mm/vmscan: centralise timeout values for reclaim_throttle	2021-11-06 13:30:40 -07:00
debug_page_ref.c
debug_vm_pgtable.c	mm: debug_vm_pgtable: don't use __P000 directly	2021-11-06 13:30:33 -07:00
debug.c	mm/migrate: de-duplicate migrate_reason strings	2021-11-06 13:30:41 -07:00
dmapool.c	mm/dmapool: use DEVICE_ATTR_RO macro	2021-06-29 10:53:52 -07:00
early_ioremap.c	mm/early_ioremap.c: remove redundant early_ioremap_shutdown()	2021-09-08 11:50:24 -07:00
fadvise.c
failslab.c
filemap.c	vfs: keep inodes with page cache off the inode shrinker LRU	2021-11-09 10:02:48 -08:00
frontswap.c	mm/mempool: minor coding style tweaks	2021-05-05 11:27:27 -07:00
gup_test.c	selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages	2021-05-05 11:27:26 -07:00
gup_test.h	selftests/vm: gup_test: fix test flag	2021-05-05 11:27:26 -07:00
gup.c	mm/gup: further simplify __gup_device_huge()	2021-11-06 13:30:34 -07:00
highmem.c	mm/highmem: remove deprecated kmap_atomic	2021-11-06 13:30:43 -07:00
hmm.c	mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled	2021-09-08 18:45:52 -07:00
huge_memory.c	mm: filemap: check if THP has hwpoisoned subpage for PMD page fault	2021-10-28 17:18:55 -07:00
hugetlb_cgroup.c	hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro	2021-11-06 13:30:39 -07:00
hugetlb_vmemmap.c	mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON	2021-06-30 20:47:26 -07:00
hugetlb_vmemmap.h	mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate	2021-06-30 20:47:25 -07:00
hugetlb.c	hugetlbfs: extend the definition of hugepages parameter to support node allocation	2021-11-06 13:30:41 -07:00
hwpoison-inject.c	mm: hwpoison: don't drop slab caches for offlining non-LRU page	2021-09-03 09:58:15 -07:00
init-mm.c	mm: add setup_initial_init_mm() helper	2021-07-08 11:48:21 -07:00
internal.h	mm/vmscan: centralise timeout values for reclaim_throttle	2021-11-06 13:30:40 -07:00
interval_tree.c	mm/interval_tree: add comments to improve code readability	2021-04-30 11:20:38 -07:00
io-mapping.c	mm: add a io_mapping_map_user helper	2021-04-30 11:20:39 -07:00
ioremap.c	mm: move ioremap_page_range to vmalloc.c	2021-09-08 11:50:24 -07:00
Kconfig	mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit	2021-11-06 13:30:42 -07:00
Kconfig.debug
khugepaged.c	mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged	2021-11-06 13:30:39 -07:00
kmemleak.c	mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp	2021-09-08 18:45:53 -07:00
ksm.c	mm/ksm: remove old GCC 4.9+ check	2021-09-13 10:18:28 -07:00
list_lru.c	mm: list_lru: only add memcg-aware lrus to the global lru list	2021-11-06 13:30:35 -07:00
maccess.c	ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault	2021-08-20 11:39:25 +01:00
madvise.c	Merge branch 'akpm' (patches from Andrew)	2021-09-03 10:08:28 -07:00
Makefile	mm: introduce Data Access MONitor (DAMON)	2021-09-08 11:50:24 -07:00
mapping_dirty_helpers.c	mm/mapping_dirty_helpers: remove double Note in kerneldoc	2021-07-01 11:06:02 -07:00
memblock.c	memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED	2021-11-06 13:30:42 -07:00
memcontrol.c	mm/vmscan: throttle reclaim when no progress is being made	2021-11-06 13:30:40 -07:00
memfd.c	Reimplement RLIMIT_MEMLOCK on top of ucounts	2021-04-30 14:14:02 -05:00
memory_hotplug.c	mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED	2021-11-06 13:30:42 -07:00
memory-failure.c	mm: hwpoison: handle non-anonymous THP correctly	2021-11-06 13:30:38 -07:00
memory.c	mm: remove redundant smp_wmb()	2021-11-06 13:30:36 -07:00
mempolicy.c	mm: migrate: make demotion knob depend on migration	2021-11-06 13:30:41 -07:00
mempool.c	kasan: use separate (un)poison implementation for integrated init	2021-06-04 19:32:21 +01:00
memremap.c	mm/memory_hotplug: remove nid parameter from arch_remove_memory()	2021-09-08 11:50:23 -07:00
memtest.c
migrate.c	mm: migrate: make demotion knob depend on migration	2021-11-06 13:30:41 -07:00
mincore.c
mlock.c	mm: introduce memfd_secret system call to create "secret" memory areas	2021-07-08 11:48:21 -07:00
mm_init.c	include/linux/page-flags-layout.h: cleanups	2021-04-30 11:20:42 -07:00
mmap_lock.c	mm: mmap_lock: fix disabling preemption directly	2021-07-23 17:43:28 -07:00
mmap.c	mm/mmap.c: fix a data race of mm->total_vm	2021-11-06 13:30:35 -07:00
mmu_gather.c
mmu_notifier.c
mmzone.c
mprotect.c	mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey()	2021-11-06 13:30:36 -07:00
mremap.c	mm, hugepages: add mremap() support for hugepage backed vma	2021-11-06 13:30:39 -07:00
msync.c	mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start	2021-04-30 11:20:37 -07:00
nommu.c	mm: nommu: kill arch_get_unmapped_area()	2021-11-06 13:30:41 -07:00
oom_kill.c	mm: mark the OOM reaper thread as freezable	2021-11-06 13:30:41 -07:00
page_alloc.c	mm/page_alloc: remove the throttling logic from the page allocator	2021-11-06 13:30:40 -07:00
page_counter.c	mm: page_counter: mitigate consequences of a page_counter underflow	2021-04-30 11:20:38 -07:00
page_ext.c	mm/page_ext.c: fix a comment	2021-11-06 13:30:34 -07:00
page_idle.c	mm/idle_page_tracking: make PG_idle reusable	2021-09-08 11:50:24 -07:00
page_io.c
page_isolation.c	mm/page_isolation: guard against possible putback unisolated page	2021-11-06 13:30:40 -07:00
page_owner.c	mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE	2021-09-08 11:50:22 -07:00
page_poison.c	mm: page_poison: print page info when corruption is caught	2021-04-30 11:20:36 -07:00
page_reporting.c	mm/page_reporting: allow driver to specify reporting order	2021-06-29 10:53:47 -07:00
page_reporting.h	mm/page_reporting: export reporting order as module parameter	2021-06-29 10:53:47 -07:00
page_vma_mapped.c	mm: device exclusive memory access	2021-07-01 11:06:03 -07:00
page-writeback.c	mm/vmscan: centralise timeout values for reclaim_throttle	2021-11-06 13:30:40 -07:00
pagewalk.c	mm: pagewalk: fix walk for hugepage tables	2021-06-29 10:53:49 -07:00
percpu-internal.h	Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu	2021-07-01 17:17:24 -07:00
percpu-km.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu-stats.c	percpu: rework memcg accounting	2021-06-05 20:43:15 +00:00
percpu-vm.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu.c	memblock: use memblock_free for freeing virtual pointers	2021-11-06 13:30:41 -07:00
pgalloc-track.h	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
pgtable-generic.c	mm/thp: fix __split_huge_pmd_locked() on shmem migration entry	2021-06-16 09:24:42 -07:00
process_vm_access.c	mm/process_vm_access.c: remove duplicate include	2021-05-05 11:27:27 -07:00
ptdump.c
readahead.c	mm/readahead.c: fix incorrect comments for get_init_ra_size	2021-11-06 13:30:41 -07:00
rmap.c	mm/rmap.c: avoid double faults migrating device private pages	2021-11-06 13:30:43 -07:00
rodata_test.c
secretmem.c	mm/secretmem: avoid letting secretmem_users drop to zero	2021-10-28 17:18:55 -07:00
shmem.c	mm: shmem: don't truncate page if memory failure happens	2021-11-06 13:30:38 -07:00
shuffle.c
shuffle.h	mm/shuffle: fix section mismatch warning	2021-05-22 15:09:07 -10:00
slab_common.c	mm: remove HARDENED_USERCOPY_FALLBACK	2021-11-06 13:30:43 -07:00
slab.c	mm: remove HARDENED_USERCOPY_FALLBACK	2021-11-06 13:30:43 -07:00
slab.h	mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()	2021-07-30 10:14:39 -07:00
slob.c
slub.c	mm: remove HARDENED_USERCOPY_FALLBACK	2021-11-06 13:30:43 -07:00
sparse-vmemmap.c	mm: remove redundant smp_wmb()	2021-11-06 13:30:36 -07:00
sparse.c	memblock: use memblock_free for freeing virtual pointers	2021-11-06 13:30:41 -07:00
swap_cgroup.c
swap_slots.c	mm: Replace deprecated CPU-hotplug functions.	2021-08-28 01:46:17 +02:00
swap_state.c	Revert "mm: swap: check if swap backing device is congested or not"	2021-08-20 11:31:42 -07:00
swap.c	mm: optimise put_pages_list()	2021-11-06 13:30:35 -07:00
swapfile.c	mm/swapfile: fix an integer overflow in swap_show()	2021-11-06 13:30:35 -07:00
truncate.c	vfs: keep inodes with page cache off the inode shrinker LRU	2021-11-09 10:02:48 -08:00
usercopy.c
userfaultfd.c	mm: shmem: don't truncate page if memory failure happens	2021-11-06 13:30:38 -07:00
util.c	mm: fix uninitialized use in overcommit_policy_handler	2021-09-24 16:13:35 -07:00
vmacache.c
vmalloc.c	mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation	2021-11-06 13:30:37 -07:00
vmpressure.c	mm/vmpressure: fix data-race with memcg->socket_pressure	2021-11-06 13:30:40 -07:00
vmscan.c	vfs: keep inodes with page cache off the inode shrinker LRU	2021-11-09 10:02:48 -08:00
vmstat.c	mm: vmstat.c: make extfrag_index show more pretty	2021-11-06 13:30:42 -07:00
workingset.c	vfs: keep inodes with page cache off the inode shrinker LRU	2021-11-09 10:02:48 -08:00
z3fold.c	mm/z3fold: add kerneldoc fields for z3fold_pool	2021-07-01 11:06:03 -07:00
zbud.c	mm/zbud: add kerneldoc fields for zbud_pool	2021-07-01 11:06:03 -07:00
zpool.c	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
zsmalloc.c	mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()	2021-11-06 13:30:43 -07:00
zswap.c	mm/zswap.c: fix two bugs in zswap_writeback_entry()	2021-06-30 20:47:31 -07:00