linux/drivers/staging
Michal Hocko 49550b6055 oom: add helpers for setting and clearing TIF_MEMDIE
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code.  I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:03 -08:00
..
android oom: add helpers for setting and clearing TIF_MEMDIE 2015-02-11 17:06:03 -08:00
board
clocking-wizard Drivers:staging:clocking-wizard: Added a newline 2014-12-02 16:45:24 -08:00
comedi staging: comedi: change some printk calls to pr_err 2014-12-02 16:54:43 -08:00
cptm1217 staging: cptm1217: Remove useless cast on void pointer 2014-10-30 13:05:46 -07:00
dgap Drivers:staging:dgap: Added a blank line after declaration 2014-11-26 14:00:22 -08:00
dgnc staging: dgnc: Remove useless cast on void pointer 2014-10-30 13:05:46 -07:00
emxx_udc Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
ft1000 staging: ft1000 : replace __attribute ((__packed__) with __packed 2014-12-02 16:48:10 -08:00
fwserial staging: fwserial: remove multiple blank lines 2014-11-26 13:53:25 -08:00
gdm72xx More ACPI and power management updates for 3.19-rc1 2014-12-18 20:28:33 -08:00
gdm724x
goldfish
gs_fpgaboot staging: gs_fpgaboot: fix a compiler warning with make W=2 2014-10-29 17:39:55 +08:00
iio Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
lustre vm: add VM_FAULT_SIGSEGV handling support 2015-01-29 10:51:32 -08:00
media [media] staging: lirc_serial: adjust boolean assignments 2015-02-03 18:18:16 -02:00
mt29f_spinand
netlogic
nvec staging: nvec: specify a platform-device base id 2015-01-25 19:04:31 +08:00
octeon Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
octeon-usb Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
olpc_dcon staging: olpc_dcon: Deletion of a check before backlight_device_unregister() 2014-11-26 14:00:22 -08:00
ozwpan Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
panel staging: panel: Move LCD-related state into struct lcd 2014-12-02 16:34:35 -08:00
rtl8188eu staging: rtl8188eu: hal: hal_intf.c: Cleaning up functions that are not used anywhere 2014-12-02 16:29:26 -08:00
rtl8192e drivers: staging: rtl8192e: Include "asm/unaligned.h" instead of "access_ok.h" in "rtl819x_BAProc.c" 2014-12-02 16:47:11 -08:00
rtl8192u Staging: rtl8192u: Use put_unaligned_le16 2014-11-03 16:09:27 -08:00
rtl8712 staging: rtl8712: remove unnecessary else after return 2014-12-02 16:54:43 -08:00
rtl8723au Here's a big pile of changes for this round. 2015-01-15 19:16:56 -05:00
rts5208 staging: rts5208: Remove useless cast on void pointer 2014-10-30 13:06:03 -07:00
skein staging: skein: fix sparse warnings related to shift operator 2014-11-22 10:53:09 -08:00
slicoss Staging: slicoss: Fix long line issues in slicoss.c 2014-12-02 16:54:43 -08:00
speakup Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-02-10 18:57:15 -08:00
ste_rmi4
unisys staging: unisys: remove duplicate header 2014-12-02 16:50:08 -08:00
vme
vt6655 staging: vt6655: fix sparse warning: argument type 2015-01-12 19:49:47 -08:00
vt6656 Staging patches for 3.19-rc1 2014-12-15 18:06:13 -08:00
wlan-ng Here's a big pile of changes for this round. 2015-01-15 19:16:56 -05:00
xgifb staging: xgifb: Removed a definition which was not used in driver 2014-11-03 16:10:38 -08:00
Kconfig ALSA: move line6 usb driver into sound/usb 2015-01-12 22:29:57 +01:00
Makefile ALSA: move line6 usb driver into sound/usb 2015-01-12 22:29:57 +01:00
staging.c