linux/drivers
Michal Hocko 49550b6055 oom: add helpers for setting and clearing TIF_MEMDIE
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code.  I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:03 -08:00
..
accessibility
acpi Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
amba Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
android
ata SCSI misc on 20150209 2015-02-11 10:28:45 -08:00
atm atm: remove deprecated use of pci api 2015-01-18 00:28:41 -05:00
auxdisplay
base ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
bcma bcma: implement host code support for PCIe Gen 2 devices 2015-01-29 10:54:43 +02:00
block xen: features and fixes for 3.20-rc0 2015-02-10 13:56:56 -08:00
bluetooth Bluetooth: btusb: Add support for Lite-On (04ca) Broadcom based, BCM43142 2015-02-03 08:57:14 +01:00
bus mvebu fixes for 3.19. (Part 4) 2015-01-23 14:08:13 -08:00
cdrom
char ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
clk Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
clocksource Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 16:59:56 -08:00
connector cn: verify msg->len before making callback 2014-11-26 19:09:01 -08:00
coresight coresight-replicator: remove .owner field for driver 2014-11-26 19:28:11 -08:00
cpufreq ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
cpuidle drivers: cpuidle: Don't initialize big.LITTLE driver if MCPM is unavailable 2015-01-23 15:05:48 +01:00
crypto Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
dca
devfreq ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
dio
dma resources: Move struct resource_list_entry from ACPI into resource core 2015-02-05 15:09:25 +01:00
dma-buf
edac EDAC, mv64x60_edac: Fix an error code in probe() 2015-01-30 17:00:43 +01:00
eisa
extcon Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
firewire firewire: sbp2: replace card lock by target lock 2014-12-10 20:53:21 +01:00
firmware * Move efivarfs from the misc filesystem section to pseudo filesystem, 2015-01-29 19:16:40 +01:00
fmc
gpio gpio: sysfs: fix memory leak in gpiod_sysfs_set_active_low 2015-01-30 10:29:33 +01:00
gpu sound updates for 3.20-rc1 2015-02-11 08:51:59 -08:00
hid Merge branches 'for-3.19/upstream-fixes', 'for-3.20/apple', 'for-3.20/betop', 'for-3.20/lenovo', 'for-3.20/logitech', 'for-3.20/rmi', 'for-3.20/upstream' and 'for-3.20/wacom' into for-linus 2015-02-09 11:17:45 +01:00
hsi hsi: nokia-modem: fix uninitialized device pointer 2015-01-04 20:19:30 +01:00
hv ACPICA: Resources: Provide common part for struct acpi_resource_address structures. 2015-01-26 16:09:56 +01:00
hwmon hwmon: (tmp102) add hibernation callbacks 2015-02-03 12:17:12 -08:00
hwspinlock
i2c i2c: sh_mobile: terminate DMA reads properly 2015-01-30 17:58:43 +01:00
ide Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
idle
iio Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2015-02-11 09:32:08 -08:00
infiniband Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
input Merge branch 'next' into for-linus 2015-02-10 11:35:36 -08:00
iommu Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 16:57:56 -08:00
ipack
irqchip Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus 2015-02-06 08:28:54 -08:00
isdn drivers: isdn: isdnloop: isdnloop.c: Remove parenthesis around return values, as specified in CodingStyle. 2015-02-05 15:40:23 -08:00
leds leds: netxbig: fix oops at probe time 2015-01-13 13:49:01 -08:00
lguest virtio: allow finalize_features to fail 2014-12-09 16:32:32 +02:00
macintosh macintosh: therm_pm72: delete deprecated driver 2014-12-19 19:32:47 +01:00
mailbox ACPI / PCC: Use pr_debug() for debug messages in pcc_init() 2015-02-05 00:40:08 +01:00
mcb mcb: mcb-pci: Only remap the 1st 0x200 bytes of BAR 0 2015-01-09 15:46:37 -08:00
md Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 14:28:42 -08:00
media [media] dvb_net: Convert local hex dump to print_hex_dump_debug 2015-02-03 18:24:44 -02:00
memory MTD updates for 3.19: 2014-12-17 09:59:26 -08:00
memstick
message
mfd - Avoid platform ID collision in da9052 2015-01-21 18:29:44 +12:00
misc SCSI misc on 20150209 2015-02-11 10:28:45 -08:00
mmc mmc: sdhci-s3c: solve problem with sleeping in atomic context 2015-02-04 13:39:14 +01:00
mtd MTD updates for 3.19: 2014-12-17 09:59:26 -08:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
nfc NFC: nci: Move NFCEE discovery logic 2015-02-04 09:15:18 +01:00
ntb
nubus
of ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
oprofile
parisc parisc/PCI: Clip bridge windows to fit in upstream windows 2015-01-16 10:04:43 -06:00
parport parport: parport_atari: Remove obsolete IRQ_TYPE_SLOW 2015-01-15 13:44:50 +01:00
pci ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
pcmcia Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
phy SCSI misc on 20150209 2015-02-11 10:28:45 -08:00
pinctrl pinctrl: at91: allow to have disabled gpio bank 2015-01-26 09:13:36 +01:00
platform Revert "platform: x86: dell-laptop: Add support for keyboard backlight" 2015-01-23 11:10:32 -08:00
pnp ACPI: Return translation offset when parsing ACPI address space resources 2015-02-03 22:27:21 +01:00
power power_supply: 88pm860x: Fix leaked power supply on probe fail 2015-01-28 15:08:10 +01:00
powercap powercap / RAPL: add IDs for future Xeon CPUs 2014-12-17 02:35:42 +01:00
pps
ps3
ptp
pwm pwm: Changes for v3.19-rc1 2014-12-17 10:10:51 -08:00
rapidio rapidio/tsi721: use PCI define for Max_Read_Request_Size 2015-01-27 08:14:26 -06:00
ras
regulator Merge remote-tracking branches 'regulator/topic/rk808', 'regulator/topic/rpm', 'regulator/topic/rt5033' and 'regulator/topic/tps65023' into regulator-next 2015-02-08 11:16:30 +08:00
remoteproc Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
reset reset: sunxi: fix spinlock initialization 2015-01-16 19:11:31 -08:00
rpmsg
rtc Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 17:53:53 -08:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 16:59:56 -08:00
sbus
scsi SCSI misc on 20150209 2015-02-11 10:28:45 -08:00
sfi SFI: fix compiler warnings 2014-12-03 18:49:20 -05:00
sh drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM 2014-12-05 03:08:24 +01:00
sn
soc Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2014-12-15 15:52:01 -08:00
spi Merge remote-tracking branch 'spi/topic/xilinx' into spi-next 2015-02-08 11:17:01 +08:00
spmi
ssb ssb: Fix Sparse error in main 2015-01-29 10:17:56 +02:00
staging oom: add helpers for setting and clearing TIF_MEMDIE 2015-02-11 17:06:03 -08:00
target netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
tc
thermal Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 16:59:56 -08:00
thunderbolt
tty serial: samsung: Add the support for Exynos5433 SoC 2015-01-09 13:46:02 -08:00
uio Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
usb xilinx usb2 gadget: get rid of incredibly annoying compile warning 2015-02-11 10:52:56 -08:00
uwb
vfio vfio-pci: Fix the check on pci device type in vfio_pci_probe() 2015-01-07 10:29:11 -07:00
vhost Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-02-05 14:33:28 -08:00
video fbdev changes for v3.20 2015-02-11 09:24:30 -08:00
virt
virtio virtio_pci: document why we defer kfree 2015-01-06 16:35:36 +02:00
vlynq
vme
w1 Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
watchdog watchdog: drop owner assignment from platform_drivers 2015-01-21 14:52:34 +01:00
xen SCSI misc on 20150209 2015-02-11 10:28:45 -08:00
zorro
Kconfig drivers/Kconfig: remove duplicate entry for soc 2015-01-25 20:26:42 +08:00
Makefile drivers: Move iommu/ before gpu/ in Makefile 2014-12-22 11:47:37 +02:00