linux/mm
Michal Hocko 6a792697a5 memcg: do not drain charge pcp caches on remote isolated cpus
Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context switching
such as vRAN and Industrial Control Systems.

There are essentially two ways how to approach the issue.  We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.

The current pcp charge cache is really optimized for high performance and
it always relies to stick with its cpu.  That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp WQ
to drain locally again.

The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining.  This adds
an atomic operation to otherwise simple arithmetic fast path in the
try_charge path.  Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with it
indirectly via user space interfaces.

Another option is to avoid draining scheduling on isolated cpus
altogether.  That means that those remote cpus would keep their charges
even after drain_all_stock returns.  This is certainly not optimal either
but it shouldn't really cause any major problems.  In the worst case (many
isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH i.e 64
page) the memory consumption of a memcg would be artificially higher than
can be immediately used from other cpus.

Theoretically a memcg OOM killer could be triggered pre-maturely. 
Currently it is not really clear whether this is a practical problem
though.  Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory reclaimed
induced by memcg limit could break user space timing expectations as those
usually expect execution in the userspace most of the time.

Also charges could be left behind on memcg removal.  Any future charge on
those isolated cpus will drain that pcp cache so this won't be a permanent
leak.

Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the target
cpu is isolated.  This solution is much more simpler.  It doesn't add any
new locking and it is more more predictable from the user space POV. 
Should the pre-mature memcg OOM become a real life problem, we can revisit
this decision.

[akpm@linux-foundation.org: memcontrol.c needs sched/isolation.h]
  Link: https://lore.kernel.org/oe-kbuild-all/202303180617.7E3aIlHf-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230317134448.11082-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: Leonardo Bras <leobras@redhat.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:43 -07:00
..
damon mm/damon/sysfs: make more kobj_type structures constant 2023-04-05 19:42:59 -07:00
kasan kasan: suppress recursive reports for HW_TAGS 2023-04-05 19:42:43 -07:00
kfence mm: kfence: fix handling discontiguous page 2023-03-28 15:24:32 -07:00
kmsan sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
backing-dev.c writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs 2023-04-16 10:41:26 -07:00
balloon_compaction.c
bootmem_info.c
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
cma.c mm: move most of core MM initialization to mm/mm_init.c 2023-04-05 19:42:52 -07:00
cma.h
compaction.c mm: compaction: fix the possible deadlock when isolating hugetlb pages 2023-04-05 19:42:50 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
debug.c mm/debug: use %pGt to display page_type in dump_page() 2023-03-28 16:20:09 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-04-05 19:42:41 -07:00
early_ioremap.c
fadvise.c mm: support POSIX_FADV_NOREUSE 2023-01-18 17:12:57 -08:00
failslab.c mm: fix unexpected changes to {failslab|fail_page_alloc}.attr 2022-11-22 18:50:44 -08:00
filemap.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
folio-compat.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
frontswap.c
gup_test.c mm/gup_test: free memory allocated via kvcalloc() using kvfree() 2022-12-15 16:37:48 -08:00
gup_test.h mm/gup_test: start/stop/read functionality for PIN LONGTERM test 2022-11-08 17:37:15 -08:00
gup.c mm/gup.c: fix typo in comments 2023-03-28 16:20:14 -07:00
highmem.c highmem: fix kmap_to_page() for kmap_local_page() addresses 2022-10-12 18:51:51 -07:00
hmm.c mm/hugetlb: make walk_hugetlb_range() safe to pmd unshare 2023-01-18 17:12:39 -08:00
huge_memory.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
hugetlb_cgroup.c mm/hugetlb: increase use of folios in alloc_huge_page() 2023-02-13 15:54:27 -08:00
hugetlb_vmemmap.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
hugetlb_vmemmap.h
hugetlb.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
hwpoison-inject.c mm/hwpoison: add __init/__exit annotations to module init/exit funcs 2022-10-03 14:03:05 -07:00
init-mm.c mm: add per-VMA lock and helper functions to control it 2023-04-05 20:02:57 -07:00
internal.h mm: conditionally write-lock VMA in free_pgtables 2023-04-05 20:02:59 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: introduce CONFIG_PER_VMA_LOCK 2023-04-05 20:02:56 -07:00
Kconfig.debug mm: introduce per-VMA lock statistics 2023-04-05 20:03:01 -07:00
khugepaged.c mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file() 2023-04-18 16:29:43 -07:00
kmemleak.c lib/stackdepot, mm: rename stack_depot_want_early_init 2023-02-16 20:43:49 -08:00
ksm.c mm: add tracepoints to ksm 2023-03-28 16:20:08 -07:00
list_lru.c
maccess.c maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() 2022-11-11 11:44:46 -08:00
madvise.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
Makefile dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
mapping_dirty_helpers.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
memblock.c mm: avoid passing 0 to __ffs() 2023-04-18 16:29:42 -07:00
memcontrol.c memcg: do not drain charge pcp caches on remote isolated cpus 2023-04-18 16:29:43 -07:00
memfd.c mm/memfd: add write seals when apply SEAL_EXEC to executable memfd 2023-01-18 17:12:37 -08:00
memory_hotplug.c mm: avoid passing 0 to __ffs() 2023-04-18 16:29:42 -07:00
memory-failure.c mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT) 2023-03-28 16:20:17 -07:00
memory-tiers.c memory tier: release the new_memtier in find_create_memory_tier() 2023-02-09 16:51:40 -08:00
memory.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
mempolicy.c mm/mempolicy: fix use-after-free of VMA iterator 2023-04-16 10:41:25 -07:00
mempool.c mempool: do not use ksize() for poisoning 2022-11-30 15:58:41 -08:00
memremap.c mm/memremap.c: fix outdated comment in devm_memremap_pages 2023-02-09 16:51:46 -08:00
memtest.c mm/memtest: add results of early memtest to /proc/meminfo 2023-04-05 19:42:55 -07:00
migrate_device.c mm: change to return bool for isolate_lru_page() 2023-02-20 12:46:17 -08:00
migrate.c mm/migrate: drop pte_mkhuge() in remove_migration_pte() 2023-03-28 16:20:11 -07:00
mincore.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
mlock.c mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates 2023-02-09 16:51:41 -08:00
mm_init.c mm: make arch_has_descending_max_zone_pfns() static 2023-04-18 16:29:42 -07:00
mm_slot.h mm: introduce common struct mm_slot 2022-10-03 14:02:43 -07:00
mmap_lock.c
mmap.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
mmzone.c mm: multi-gen LRU: groundwork 2022-09-26 19:46:09 -07:00
mprotect.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
mremap.c mm/mremap: write-lock VMA while remapping it to a new address range 2023-04-05 20:02:58 -07:00
msync.c mm/msync: use vma_find() instead of vma linked list 2022-09-26 19:46:25 -07:00
nommu.c mm: vmalloc: convert vread() to vread_iter() 2023-04-05 19:42:57 -07:00
oom_kill.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
page_alloc.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
page_counter.c
page_ext.c mm/page_ext: init page_ext early if there are no deferred struct pages 2023-02-02 22:33:22 -08:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
page_isolation.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_owner.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_poison.c
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c mm/page_ext: do not allocate space for page_ext->flags if not needed 2023-02-02 22:33:11 -08:00
page_vma_mapped.c mm/hugetlb: introduce hugetlb_walk() 2023-01-18 17:12:39 -08:00
page-writeback.c mm,jfs: move write_one_page/folio_write_one to jfs 2023-03-28 16:20:14 -07:00
pagewalk.c mm/hugetlb: introduce hugetlb_walk() 2023-01-18 17:12:39 -08:00
percpu-internal.h mm: percpu: fix incorrect size in pcpu_obj_full_size() 2023-02-16 20:43:55 -08:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: memcontrol: rename memcg_kmem_enabled() 2023-02-16 20:43:56 -08:00
pgalloc-track.h
pgtable-generic.c mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() 2023-03-28 16:20:12 -07:00
process_vm_access.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
ptdump.c
readahead.c readahead: convert readahead_expand() to use a folio 2023-02-02 22:33:21 -08:00
rmap.c mm/khugepaged: write-lock VMA while collapsing a huge page 2023-04-05 20:02:58 -07:00
rodata_test.c mm/rodata_test: use PAGE_ALIGNED() helper 2022-10-03 14:03:05 -07:00
secretmem.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
shmem.c mm: userfaultfd: combine 'mode' and 'wp_copy' arguments 2023-04-05 19:42:48 -07:00
shrinker_debug.c mm: shrinkers: convert shrinker_rwsem to mutex 2023-03-28 16:20:17 -07:00
shuffle.c mm/shuffle: convert module_param_call to module_param_cb 2022-10-03 14:03:07 -07:00
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab_common.c mm/kasan: simplify and refine kasan_cache code 2023-01-18 17:12:55 -08:00
slab.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab.h mm: move kmem_cache_init() declaration to mm/slab.h 2023-04-05 19:42:54 -07:00
slob.c Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-next 2022-09-29 11:30:55 +02:00
slub.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
sparse-vmemmap.c mm/sparse-vmemmap: generalise vmemmap_populate_hugepages() 2022-12-11 18:12:12 -08:00
sparse.c mm/sparse: fix "unused function 'pgdat_to_phys'" warning 2023-02-02 22:33:29 -08:00
swap_cgroup.c mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled 2022-10-03 14:03:36 -07:00
swap_slots.c mm/swap: convert put_swap_page() to put_swap_folio() 2022-10-03 14:02:46 -07:00
swap_state.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
swap.c mm: swap: fix performance regression on sparsetruncate-tiny 2023-04-16 10:41:24 -07:00
swap.h mm: remove the __swap_writepage return value 2023-02-02 22:33:33 -08:00
swapfile.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
truncate.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
usercopy.c mm: use kstrtobool() instead of strtobool() 2022-11-30 15:58:45 -08:00
userfaultfd.c mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs 2023-04-05 19:42:48 -07:00
util.c mm: fix typo in __vm_enough_memory warning 2023-02-13 15:54:33 -08:00
vmalloc.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
vmpressure.c
vmscan.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
vmstat.c mm: introduce per-VMA lock statistics 2023-04-05 20:03:01 -07:00
workingset.c swap_state: update shadow_nodes for anonymous page 2023-02-02 22:33:24 -08:00
z3fold.c mm: remove PageMovable export 2023-01-18 17:12:57 -08:00
zbud.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zpool.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zsmalloc.c zsmalloc: reset compaction source zspage pointer after putback_zspage() 2023-04-18 16:29:42 -07:00
zswap.c mm/zswap: try to avoid worst-case scenario on same element pages 2023-03-28 16:20:07 -07:00