linux

iv/linux

History

Johannes Weiner 63fd327016 mm: memcontrol: don't throttle dying tasks on memory.high

While investigating hosts with high cgroup memory pressures, Tejun
found culprit zombie tasks that had were holding on to a lot of
memory, had SIGKILL pending, but were stuck in memory.high reclaim.

In the past, we used to always force-charge allocations from tasks
that were exiting in order to accelerate them dying and freeing up
their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
prohibit unconditional exceeding the limit of dying tasks"); it noted
that this can cause (userspace inducable) containment failures, so it
added a mandatory reclaim and OOM kill cycle before forcing charges.
At the time, memory.high enforcement was handled in the userspace
return path, which isn't reached by dying tasks, and so memory.high
was still never enforced by dying tasks.

When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
overcharges") added synchronous reclaim for memory.high, it added
unconditional memory.high enforcement for dying tasks as well. The
callstack shows that this path is where the zombie is stuck in.

We need to accelerate dying tasks getting past memory.high, but we
cannot do it quite the same way as we do for memory.max: memory.max is
enforced strictly, and tasks aren't allowed to move past it without
FIRST reclaiming and OOM killing if necessary. This ensures very small
levels of excess. With memory.high, though, enforcement happens lazily
after the charge, and OOM killing is never triggered. A lot of
concurrent threads could have pushed, or could actively be pushing,
the cgroup into excess. The dying task will enter reclaim on every
allocation attempt, with little hope of restoring balance.

To fix this, skip synchronous memory.high enforcement on dying tasks
altogether again. Update memory.high path documentation while at it.

[hannes@cmpxchg.org: also handle tasks are being killed during the reclaim]
  Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org
Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org
Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2024-01-25 23:52:20 -08:00

damon

mm/damon/vaddr: change asm-generic/mman-common.h to linux/mman.h

2023-12-29 11:58:57 -08:00

kasan

kasan: avoid resetting aux_lock

2024-01-12 15:20:45 -08:00

kfence

KFENCE: cleanup kfence_guarded_alloc() after CONFIG_SLAB removal

2023-12-05 11:17:58 +01:00

kmsan

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

backing-dev.c

writeback: remove redundant checks for root memcg

2023-08-21 13:37:48 -07:00

balloon_compaction.c

…

bootmem_info.c

bootmem: use kmemleak_free_part_phys in put_page_bootmem

2023-10-25 16:47:13 -07:00

cma_debug.c

…

cma_sysfs.c

mm: cma: make kobj_type structure constant

2023-03-28 16:20:06 -07:00

cma.c

mm: cma: remove unnecessary initialization of ret

2023-12-12 10:57:08 -08:00

cma.h

…

compaction.c

Generic:

2024-01-17 13:03:37 -08:00

debug_page_alloc.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

debug_page_ref.c

…

debug_vm_pgtable.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

debug.c

mm: update validate_mm() to use vma iterator

2023-06-09 16:25:31 -07:00

dmapool_test.c

dmapool: add alloc/free performance test

2023-04-05 19:42:38 -07:00

dmapool.c

mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs

2023-12-05 11:17:58 +01:00

early_ioremap.c

mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()

2023-06-09 16:25:56 -07:00

fadvise.c

mm: remove unnecessary pagevec includes

2023-06-23 16:59:31 -07:00

fail_page_alloc.c

mm: page_alloc: split out FAIL_PAGE_ALLOC

2023-06-09 16:25:23 -07:00

failslab.c

…

filemap.c

vfs-6.8.netfs

2024-01-19 09:10:23 -08:00

folio-compat.c

mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictable

2023-12-29 11:58:27 -08:00

gup_test.c

Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.

2023-06-23 16:58:19 -07:00

gup_test.h

…

gup.c

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

2023-12-29 11:58:56 -08:00

highmem.c

x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs

2023-12-29 12:22:28 -08:00

hmm.c

mm: enable page walking API to lock vmas during the walk

2023-08-21 13:07:20 -07:00

huge_memory.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

hugetlb_cgroup.c

mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER

2023-10-18 14:34:17 -07:00

hugetlb_vmemmap.c

mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range()

2023-12-12 10:57:08 -08:00

hugetlb_vmemmap.h

mm: hugetlb_vmemmap: fix reference to nonexistent file

2023-10-25 16:47:14 -07:00

hugetlb.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

hwpoison-inject.c

…

init-mm.c

mm: Deprecate pasid field

2023-12-12 10:11:32 +01:00

internal.h

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

interval_tree.c

…

io-mapping.c

…

ioremap.c

mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed

2023-08-18 10:12:36 -07:00

Kconfig

IOMMU Updates for Linux v6.8

2024-01-18 15:16:57 -08:00

Kconfig.debug

mm/slab: remove CONFIG_SLAB from all Kconfig and Makefile

2023-12-05 11:14:40 +01:00

khugepaged.c

header cleanups for 6.8

2024-01-10 16:43:55 -08:00

kmemleak.c

kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers

2023-12-12 10:57:07 -08:00

ksm.c

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

2023-12-29 11:58:56 -08:00

list_lru.c

mm/list_lru.c: remove unused list_lru_from_kmem()

2023-12-20 14:48:11 -08:00

maccess.c

mm: Fix copy_from_user_nofault().

2023-04-12 17:36:23 -07:00

madvise.c

mm: return a folio from read_swap_cache_async()

2023-12-29 11:58:32 -08:00

Makefile

mm/slab: remove CONFIG_SLAB from all Kconfig and Makefile

2023-12-05 11:14:40 +01:00

mapping_dirty_helpers.c

mm: fix clean_record_shared_mapping_range kernel-doc

2023-08-24 16:20:30 -07:00

memblock.c

memblock: code readability improvement

2024-01-18 16:46:18 -08:00

memcontrol.c

mm: memcontrol: don't throttle dying tasks on memory.high

2024-01-25 23:52:20 -08:00

memfd.c

memfd: drop warning for missing exec-related flags

2023-10-04 10:32:22 -07:00

memory_hotplug.c

mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval

2024-01-12 15:20:48 -08:00

memory-failure.c

fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling

2024-01-25 23:52:20 -08:00

memory-tiers.c

base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'

2023-12-22 14:23:13 -08:00

memory.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

mempolicy.c

Many singleton patches against the MM code. The patch series which are

2023-11-02 19:38:47 -10:00

mempool.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

memremap.c

mm: remove stale example from comment

2023-12-29 11:58:26 -08:00

memtest.c

mm: memtest: convert to memtest_report_meminfo()

2023-08-21 13:37:47 -07:00

migrate_device.c

mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()

2023-12-29 11:58:56 -08:00

migrate.c

Generic:

2024-01-17 13:03:37 -08:00

mincore.c

mm: enable page walking API to lock vmas during the walk

2023-08-21 13:07:20 -07:00

mlock.c

mm: mlock: avoid folio_within_range() on KSM pages

2023-10-25 16:47:14 -07:00

mm_init.c

efi: disable mirror feature during crashkernel

2024-01-12 15:20:47 -08:00

mm_slot.h

…

mmap_lock.c

…

mmap.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

mmu_gather.c

mm/memory: page_remove_rmap() -> folio_remove_rmap_pte()

2023-12-29 11:58:54 -08:00

mmu_notifier.c

mmu_notifiers: rename invalidate_range notifier

2023-08-18 10:12:41 -07:00

mmzone.c

zswap: shrink zswap pool based on memory pressure

2023-12-12 10:57:02 -08:00

mprotect.c

mm: mprotect: use a folio in change_pte_range()

2023-10-25 16:47:12 -07:00

mremap.c

mm: abstract VMA merge and extend into vma_merge_extend() helper

2023-10-18 14:34:18 -07:00

msync.c

…

nommu.c

Many singleton patches against the MM code. The patch series which are

2023-11-02 19:38:47 -10:00

oom_kill.c

mm, oom:dump_tasks add rss detailed information printing

2023-12-10 16:51:53 -08:00

page_alloc.c

Networking changes for 6.8.

2024-01-11 10:07:29 -08:00

page_counter.c

…

page_ext.c

mm/page_ext: move functions around for minor cleanups to page_ext

2023-08-18 10:12:31 -07:00

page_idle.c

…

page_io.c

zswap: memcontrol: implement zswap writeback disabling

2023-12-29 20:22:11 -08:00

page_isolation.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

page_owner.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

page_poison.c

mm/page_poison: replace kmap_atomic() with kmap_local_page()

2023-12-10 16:51:50 -08:00

page_reporting.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

page_reporting.h

…

page_table_check.c

mm: convert page_table_check_pte_set() to page_table_check_ptes_set()

2023-08-24 16:20:18 -07:00

page_vma_mapped.c

mm: thp: introduce multi-size THP sysfs interface

2023-12-20 14:48:12 -08:00

page-writeback.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

pagewalk.c

mm: pagewalk: assert write mmap lock only for walking the user page tables

2023-12-10 16:51:53 -08:00

percpu-internal.h

percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing

2023-06-19 16:19:29 -07:00

percpu-km.c

…

percpu-stats.c

…

percpu-vm.c

…

percpu.c

mm: Introduce flush_cache_vmap_early()

2023-12-14 00:23:17 -08:00

pgalloc-track.h

…

pgtable-generic.c

mm/pgtable: notes on pte_offset_map[_lock]()

2023-08-18 10:12:25 -07:00

process_vm_access.c

mm: fix process_vm_rw page counts

2023-12-10 16:51:39 -08:00

ptdump.c

mm: ptdump should use ptep_get_lockless()

2023-06-19 16:19:24 -07:00

readahead.c

readahead: avoid multiple marked readahead pages

2024-01-25 23:52:20 -08:00

rmap.c

mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED

2023-12-29 11:58:56 -08:00

rodata_test.c

…

secretmem.c

mm/secretmem: use a folio in secretmem_fault()

2023-08-21 13:38:02 -07:00

shmem_quota.c

shmem: Add default quota limit mount options

2023-08-09 09:15:40 +02:00

shmem.c

header cleanups for 6.8

2024-01-10 16:43:55 -08:00

show_mem.c

mm, treewide: introduce NR_PAGE_ORDERS

2024-01-08 15:27:15 -08:00

shrinker_debug.c

mm: shrinker: convert shrinker_rwsem to mutex

2023-10-04 10:32:26 -07:00

shrinker.c

mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info()

2024-01-05 09:58:32 -08:00

shuffle.c

…

shuffle.h

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

slab_common.c

slub: use a folio in __kmalloc_large_node

2024-01-05 10:17:46 -08:00

slab.h

mm/slab: move kmalloc() functions from slab_common.c to slub.c

2023-12-06 11:57:21 +01:00

slub.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

sparse-vmemmap.c

mm/vmemmap: allow architectures to override how vmemmap optimization works

2023-08-18 10:12:53 -07:00

sparse.c

mm/sparsemem: fix race in accessing memory_section->usage

2023-12-29 11:58:43 -08:00

swap_cgroup.c

…

swap_slots.c

…

swap_state.c

mm: convert swap_cluster_readahead and swap_vma_readahead to return a folio

2023-12-29 11:58:32 -08:00

swap.c

mm: remove references to pagevec

2023-06-23 16:59:30 -07:00

swap.h

mm: convert swap_cluster_readahead and swap_vma_readahead to return a folio

2023-12-29 11:58:32 -08:00

swapfile.c

header cleanups for 6.8

2024-01-10 16:43:55 -08:00

truncate.c

fs: convert error_remove_page to error_remove_folio

2023-12-10 16:51:42 -08:00

usercopy.c

mm: Fix copy_from_user_nofault().

2023-04-12 17:36:23 -07:00

userfaultfd.c

userfaultfd: avoid huge_zero_page in UFFDIO_MOVE

2024-01-12 15:20:49 -08:00

util.c

mm/util: use kmap_local_page() in memcmp_pages()

2023-12-10 16:51:49 -08:00

vmalloc.c

mm/vmalloc: fix the unchecked dereference warning in vread_iter()

2023-11-01 12:38:35 -07:00

vmpressure.c

eventfd: simplify eventfd_signal()

2023-11-28 14:08:38 +01:00

vmscan.c

Many singleton patches against the MM code. The patch series which

2024-01-09 11:18:47 -08:00

vmstat.c

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08 15:27:15 -08:00

workingset.c

mm: ratelimit stat flush from workingset shrinker

2024-01-05 10:17:45 -08:00

z3fold.c

mm/z3fold: remove obsolete comment for struct z3fold_pool

2023-08-21 13:37:51 -07:00

zbud.c

mm: zswap: remove shrink from zpool interface

2023-06-19 16:19:27 -07:00

zpool.c

mm: zswap: remove shrink from zpool interface

2023-06-19 16:19:27 -07:00

zsmalloc.c

mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large

2024-01-05 10:17:47 -08:00

zswap.c

zswap: memcontrol: implement zswap writeback disabling

2023-12-29 20:22:11 -08:00