linux

iv/linux

History

Michal Hocko 49550b6055 oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for `5695be142e` ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2015-02-11 17:06:03 -08:00
..
backing-dev.c	Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block	2014-10-18 11:53:51 -07:00
balloon_compaction.c	mm/balloon_compaction: fix deflation when compaction is disabled	2014-10-29 16:33:15 -07:00
bootmem.c	mem-hotplug: reset node managed pages when hot-adding a new pgdat	2014-11-13 16:17:06 -08:00
cleancache.c	mm: fix cleancache debugfs directory path	2015-01-20 14:08:31 +01:00
cma.c	mm: cma: split cma-reserved in dmesg log	2014-12-18 19:08:10 -08:00
compaction.c	mm: reduce try_to_compact_pages parameters	2015-02-11 17:06:02 -08:00
debug-pagealloc.c	mm/debug-pagealloc: make debug-pagealloc boottime configurable	2014-12-13 12:42:48 -08:00
debug.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
dmapool.c	mm/dmapool.c: fixed a brace coding style issue	2014-10-09 22:26:00 -04:00
early_ioremap.c	mm: create generic early_ioremap() support	2014-04-07 16:36:15 -07:00
fadvise.c	mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages	2014-12-13 12:42:49 -08:00
failslab.c
filemap_xip.c	mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub	2015-02-10 14:30:30 -08:00
filemap.c	mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub	2015-02-10 14:30:30 -08:00
frontswap.c	mm/frontswap.c: fix the condition in BUG_ON	2014-12-10 17:41:08 -08:00
gup.c	mm/hugetlb: take page table lock in follow_huge_pmd()	2015-02-11 17:06:01 -08:00
highmem.c	mm/highmem: make kmap cache coloring aware	2014-08-06 18:01:22 -07:00
huge_memory.c	mm:add KPF_ZERO_PAGE flag for /proc/kpageflags	2015-02-11 17:06:00 -08:00
hugetlb_cgroup.c	mm: page_counter: pull "-1" handling out of page_counter_memparse()	2015-02-11 17:06:02 -08:00
hugetlb.c	mm/hugetlb: add migration entry check in __unmap_hugepage_range	2015-02-11 17:06:01 -08:00
hwpoison-inject.c	mm/hwpoison-inject.c: remove unnecessary null test before debugfs_remove_recursive	2014-08-06 18:01:19 -07:00
init-mm.c
internal.h	mm: reduce try_to_compact_pages parameters	2015-02-11 17:06:02 -08:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
iov_iter.c	copy_from_iter_nocache()	2014-12-08 20:25:23 -05:00
Kconfig	rcu: Make SRCU optional by using CONFIG_SRCU	2015-01-06 11:04:29 -08:00
Kconfig.debug	mm/debug_pagealloc: remove obsolete Kconfig options	2015-01-08 15:10:52 -08:00
kmemcheck.c	mm/slab_common: move kmem_cache definition to internal header	2014-10-09 22:25:50 -04:00
kmemleak-test.c	mm/kmemleak-test.c: use pr_fmt for logging	2014-06-06 16:08:18 -07:00
kmemleak.c	mm: introduce kmemleak_update_trace()	2014-06-06 16:08:17 -07:00
ksm.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
list_lru.c	mm: keep page cache radix tree nodes in check	2014-04-03 16:21:01 -07:00
maccess.c
madvise.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
Makefile	mm: replace remap_file_pages() syscall with emulation	2015-02-10 14:30:30 -08:00
memblock.c	mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG	2014-12-13 12:42:46 -08:00
memcontrol.c	oom: add helpers for setting and clearing TIF_MEMDIE	2015-02-11 17:06:03 -08:00
memory_hotplug.c	mm, memory_hotplug/failure: drain single zone pcplists	2014-12-10 17:41:05 -08:00
memory-failure.c	mm: vmscan: invoke slab shrinkers from shrink_zone()	2014-12-13 12:42:48 -08:00
memory.c	Merge branch 'akpm' (patches from Andrew)	2015-02-10 16:45:56 -08:00
mempolicy.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-19 18:19:19 -08:00
mempool.c	mm/mempool.c: update the kmemleak stack trace for mempool allocations	2014-06-06 16:08:17 -07:00
migrate.c	mm/hugetlb: take page table lock in follow_huge_pmd()	2015-02-11 17:06:01 -08:00
mincore.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
mlock.c	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-10-13 15:44:12 +02:00
mm_init.c
mmap.c	rmap: drop support of non-linear mappings	2015-02-10 14:30:31 -08:00
mmu_context.c
mmu_notifier.c	mmu_notifier: add the callback for mmu_notifier_invalidate_range()	2014-11-13 13:46:09 +11:00
mmzone.c	mm: microoptimize zonelist operations	2015-02-11 17:06:02 -08:00
mprotect.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
mremap.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
msync.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
nobootmem.c	mem-hotplug: reset node managed pages when hot-adding a new pgdat	2014-11-13 16:17:06 -08:00
nommu.c	mm: replace remap_file_pages() syscall with emulation	2015-02-10 14:30:30 -08:00
oom_kill.c	oom: add helpers for setting and clearing TIF_MEMDIE	2015-02-11 17:06:03 -08:00
page_alloc.c	mm: use correct format specifiers when printing address ranges	2015-02-11 17:06:02 -08:00
page_counter.c	mm: page_counter: pull "-1" handling out of page_counter_memparse()	2015-02-11 17:06:02 -08:00
page_ext.c	mm/page_owner: keep track of page owners	2014-12-13 12:42:48 -08:00
page_io.c	fix __swap_writepage() compile failure on old gcc versions	2014-06-14 19:30:48 -05:00
page_isolation.c	mm, page_isolation: drain single zone pcplists	2014-12-10 17:41:05 -08:00
page_owner.c	mm/page_owner: correct owner information for early allocated pages	2014-12-13 12:42:48 -08:00
page-writeback.c	mm: memcontrol: track move_lock state internally	2015-02-11 17:06:00 -08:00
pagewalk.c	mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range	2015-02-05 13:35:29 -08:00
percpu-km.c	percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated	2014-09-02 14:46:05 -04:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	percpu: off by one in BUG_ON()	2014-10-29 10:34:34 -04:00
pgtable-generic.c	mm: actually clear pmd_numa before invalidating	2014-08-29 16:28:15 -07:00
process_vm_access.c	start adding the tag to iov_iter	2014-05-06 17:32:49 -04:00
quicklist.c
readahead.c	mm/readahead.c: remove unused file_ra_state from count_history_pages	2014-08-06 18:01:15 -07:00
rmap.c	mm: memcontrol: track move_lock state internally	2015-02-11 17:06:00 -08:00
shmem.c	swap: remove unused mem_cgroup_uncharge_swapcache declaration	2015-02-11 17:06:00 -08:00
slab_common.c	memcg: zap memcg_slab_caches and memcg_slab_mutex	2015-02-10 14:30:34 -08:00
slab.c	slab: fix cpuset check in fallback_alloc	2014-12-13 12:42:53 -08:00
slab.h	memcg: zap __memcg_{charge,uncharge}_slab	2015-02-10 14:30:34 -08:00
slob.c	mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller()	2014-10-09 22:25:50 -04:00
slub.c	mm/slub.c: fix typo in comment	2015-02-10 14:30:30 -08:00
sparse-vmemmap.c
sparse.c	mm: use macros from compiler.h instead of __attribute__((...))	2014-04-07 16:35:54 -07:00
swap_cgroup.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap_state.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap.c	rmap: drop support of non-linear mappings	2015-02-10 14:30:31 -08:00
swapfile.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
truncate.c	mm: Fix comment before truncate_setsize()	2014-11-07 08:29:25 +11:00
util.c	proc/maps: make vm_is_stack() logic namespace-friendly	2014-10-09 22:25:50 -04:00
vmacache.c	mm,vmacache: count number of system-wide flushes	2014-12-13 12:42:48 -08:00
vmalloc.c	mm/vmalloc.c: fix memory ordering bug	2014-12-13 12:42:49 -08:00
vmpressure.c	mm/vmpressure.c: fix race in vmpressure_work_fn()	2014-12-02 17:32:07 -08:00
vmscan.c	mm: memcontrol: default hierarchy interface for memory	2015-02-11 17:06:02 -08:00
vmstat.c	mm/vmstat.c: fix/cleanup ifdefs	2015-02-10 14:30:30 -08:00
workingset.c	mm: keep page cache radix tree nodes in check	2014-04-03 16:21:01 -07:00
zbud.c	mm/zbud: init user ops only when it is needed	2014-12-13 12:42:51 -08:00
zpool.c	mm/zpool: use prefixed module loading	2014-08-29 16:28:16 -07:00
zsmalloc.c	mm/zsmalloc: adjust order of functions	2014-12-18 19:08:11 -08:00
zswap.c	mm/zswap: delete unnecessary check before calling free_percpu()	2014-12-13 12:42:50 -08:00