linux/drivers
Dan Schatzberg 87579e9b7d loop: use worker per cgroup instead of kworker
Patch series "Charge loop device i/o to issuing cgroup", v14.

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup.  This
allows a loop device to be used to trivially bypass resource limits and
other policy.  This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through the
single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device.  This patch series does some
minor modification to the loop driver so that each cgroup can make forward
progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills when
writing through the loop device as expected.

This patch (of 3):

Existing uses of loop device may have multiple cgroups reading/writing to
the same device.  Simply charging resources for I/O to the backing file
could result in priority inversion where one cgroup gets synchronously
blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue where
each work item is a "struct loop_worker" which contains a queue of struct
loop_cmds to issue.  The loop device maintains a tree mapping blk css_id
-> loop_worker.  This allows each cgroup to independently make forward
progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which can
be used in cases of extreme memory shortage where we cannot allocate a
loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire a
per-loop-device spinlock any time either is accessed.  The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

[colin.king@canonical.com: fixes]

Link: https://lkml.kernel.org/r/20210610173944.1203706-1-schatzberg.dan@gmail.com
Link: https://lkml.kernel.org/r/20210610173944.1203706-2-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
..
accessibility TTY/Serial driver updates for 5.13-rc1 2021-04-26 11:20:10 -07:00
acpi Merge branch 'acpi-bus' 2021-06-11 17:57:24 +02:00
amba
android binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION 2021-05-13 20:35:26 +02:00
ata pci-v5.13-changes 2021-05-05 13:24:11 -07:00
atm atm: firestream: Use fallthrough pseudo-keyword 2021-05-07 16:01:08 -07:00
auxdisplay treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
base software node: Handle software node injection to an existing device properly 2021-06-23 19:34:58 +02:00
bcma bcma: remove unused function 2021-04-18 09:36:56 +03:00
block loop: use worker per cgroup instead of kworker 2021-06-29 10:53:50 -07:00
bluetooth Networking fixes for 5.13-rc5, including fixes from bpf, wireless, 2021-06-04 18:25:39 -07:00
bus Char/Misc driver fixes for 5.13-rc6 2021-06-12 12:13:55 -07:00
cdrom cdrom: gdrom: initialize global variable at init time 2021-05-13 18:58:44 +02:00
char Char/misc driver fixes for 5.13-rc3 2021-05-20 06:31:52 -10:00
clk clk: Skip clk provider registration when np is NULL 2021-05-11 08:47:25 +02:00
clocksource clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86 2021-05-14 14:55:13 +02:00
comedi staging: comedi: move out of staging directory 2021-04-15 09:26:25 +02:00
connector
counter
cpufreq Revert "cpufreq: CPPC: Add support for frequency invariance" 2021-06-14 15:55:02 +02:00
cpuidle
crypto Revert "crypto: cavium/nitrox - add an error message to explain the failure of pci_request_mem_regions" 2021-05-13 17:23:05 +02:00
cxl cxl/mem: Fix memory device capacity probing 2021-04-16 18:21:56 -07:00
dax fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
dca
devfreq
dio
dma dmaengine fixes for v5.13 2021-06-16 09:03:52 -07:00
dma-buf dma-buf: fix unintended pin/unpin warnings 2021-05-20 14:02:27 +02:00
edac x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG 2021-05-10 07:51:38 +02:00
eisa
extcon - Core Frameworks 2021-04-28 15:59:13 -07:00
firewire The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
firmware EFI fixes for v5.13-rc 2021-05-23 11:39:02 +02:00
fpga ARM: SoC drivers for v5.13 2021-04-26 12:11:52 -07:00
fsi
gnss
gpio gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAP 2021-06-25 12:13:53 +02:00
gpu A DMA address check for nouveau, an error code return fix for kmb, fixes 2021-06-25 06:05:13 +10:00
greybus greybus: es2: fix kernel-doc warnings 2021-04-16 07:26:50 +02:00
hid HID: asus: Cleanup Asus T101HA keyboard-dock handling 2021-05-27 15:40:35 +02:00
hsi HSI: core: fix resource leaks in hsi_add_client_from_dt() 2021-04-16 00:14:49 +02:00
hv printk changes for 5.13 2021-04-27 18:09:44 -07:00
hwmon hwmon: (tps23861) correct shunt LSB values 2021-06-10 08:40:09 -07:00
hwspinlock
hwtracing ARM: 2021-05-01 10:14:08 -07:00
i2c i2c: robotfuzz-osif: fix control-request directions 2021-06-24 22:08:00 +02:00
i3c Revert "i3c master: fix missing destroy_workqueue() on error in i3c_master_register" 2021-04-24 22:21:01 +02:00
ide
idle
iio iio: adc: ad7793: Add missing error code in ad7793_setup() 2021-05-22 08:32:36 +01:00
infiniband IB/mlx5: Fix initializing CQ fragments buffer 2021-06-10 08:59:34 -03:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2021-05-06 23:37:55 -07:00
interconnect interconnect: qcom: Add missing MODULE_DEVICE_TABLE 2021-05-11 07:26:31 +03:00
iommu iommu/vt-d: Fix sysfs leak in alloc_iommu() 2021-05-27 16:07:08 +02:00
ipack
irqchip irqchip fixes for 5.13, take #2 2021-06-17 15:22:31 +02:00
isdn Networking fixes for 5.13-rc4, including fixes from bpf, netfilter, 2021-05-26 17:44:49 -10:00
leds leds: lp5523: check return value of lp5xx_read and jump to cleanup code 2021-05-13 17:30:15 +02:00
lightnvm lightnvm: deprecated OCSSD support and schedule it for removal in Linux 5.15 2021-04-13 09:16:12 -06:00
macintosh macintosh/via-pmu: Fix build warning 2021-04-16 23:57:51 +10:00
mailbox - qcom: enable support for SM8350 and SC7280 2021-04-28 16:10:33 -07:00
mcb
md block-5.13-2021-06-12 2021-06-12 11:59:58 -07:00
media media: gspca: properly check for errors in po1030_probe() 2021-05-13 18:58:32 +02:00
memory .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
memstick memstick: r592: ignore kfifo_out() return code again 2021-04-26 11:08:23 +02:00
message scsi: message: fusion: Remove unused local variable 'vtarget' 2021-04-13 01:39:12 -04:00
mfd - Core Frameworks 2021-04-28 15:59:13 -07:00
misc misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG 2021-06-09 19:10:22 +02:00
mmc mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk 2021-06-14 14:02:33 +02:00
most Staging/IIO driver updates for 5.13-rc1 2021-04-26 11:14:21 -07:00
mtd mtd: parsers: ofpart: fix parsing subpartitions 2021-05-10 18:34:30 +02:00
mux
net Networking fixes for 5.13-rc7, including fixes from wireless, bpf, 2021-06-18 18:55:29 -07:00
nfc NFC: nfcmrvl: fix kernel-doc syntax in file headers 2021-05-23 17:26:38 -07:00
ntb
nubus
nvdimm include: remove pagemap.h from blkdev.h 2021-05-06 19:24:11 -07:00
nvme nvmet: fix freeing unallocated p2pmem 2021-06-02 10:10:38 +03:00
nvmem
of of: overlay: Remove redundant assignment to ret 2021-05-03 13:57:56 -05:00
opp
parisc
parport treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
pci Revert "PCI: PM: Do not read power state in pci_enable_device_flags()" 2021-06-22 17:35:18 +02:00
pcmcia
perf ARM: 2021-05-01 10:14:08 -07:00
phy phy: Sparx5 Eth SerDes: check return value after calling platform_get_resource() 2021-06-03 11:18:19 +05:30
pinctrl pinctrl: stm32: fix the reported number of GPIO lines per bank 2021-06-18 14:56:54 +02:00
platform platform/mellanox: mlxreg-hotplug: Revert "move to use request_irq by IRQF_NO_AUTOEN flag" 2021-06-04 22:03:13 +02:00
pnp
power power supply and reset changes for the v5.13 series 2021-04-28 15:43:58 -07:00
powercap
pps TTY/Serial driver updates for 5.13-rc1 2021-04-26 11:20:10 -07:00
ps3
ptp ptp: improve max_adj check against unreasonable values 2021-06-15 10:59:46 -07:00
pwm pwm: Changes for v5.13-rc1 2021-05-05 12:53:16 -07:00
rapidio rapidio: handle create_workqueue() failure 2021-05-13 18:32:19 +02:00
ras
regulator regulator: Fixes for v5.14 2021-06-08 09:41:16 -07:00
remoteproc remoteproc updates for v5.13 2021-05-04 11:13:33 -07:00
reset pci-v5.13-changes 2021-05-05 13:24:11 -07:00
rpmsg rpmsg: qcom_glink_native: fix error return code of qcom_glink_rx_data() 2021-04-09 11:08:42 -05:00
rtc RTC for 5.13 2021-05-03 12:15:21 -07:00
s390 s390/vfio-ap: clean up mdev resources when remove callback invoked 2021-06-21 11:19:18 +02:00
sbus
scsi SCSI fixes on 20210625 2021-06-25 15:59:14 -07:00
sh The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
siox
slimbus
soc soc: amlogic: meson-clk-measure: remove redundant dev_err call in meson_msr_probe() 2021-05-31 09:26:58 +02:00
soundwire soundwire: qcom: fix handling of qcom,ports-block-pack-mode 2021-05-13 11:14:13 +05:30
spi spi: spi-nxp-fspi: move the register operation after the clock enable 2021-06-14 15:02:01 +01:00
spmi
ssb
staging Networking fixes for 5.13-rc7, including fixes from wireless, bpf, 2021-06-18 18:55:29 -07:00
target scsi: target: core: Fix warning on realtime kernels 2021-05-31 22:59:13 -04:00
tc
tee OP-TEE use export_uuid() to copy UUID 2021-06-05 15:43:11 -07:00
thermal - Fix out-of-spec hardware (1st gen Hygon) which does not implement 2021-06-06 12:25:43 -07:00
thunderbolt thunderbolt: usb4: Fix NVM read buffer bounds and offset issue 2021-05-20 11:52:58 +03:00
tty serial: 8250_exar: Avoid NULL pointer dereference at ->exit() 2021-06-09 14:40:48 +02:00
uio uio_hv_generic: Fix another memory leak in error handling paths 2021-05-14 13:26:04 +02:00
usb usb: core: hub: Disable autosuspend for Cypress CY7C65632 2021-06-17 15:34:21 +02:00
vdpa {net,vdpa}/mlx5: Configure interface MAC into mpfs L2 table 2021-05-18 23:01:48 -07:00
vfio vfio/platform: fix module_put call in error flow 2021-05-24 13:40:13 -06:00
vhost virtio,vhost,vdpa: features, fixes 2021-05-05 13:31:39 -07:00
video Revert "fb_defio: Remove custom address_space_operations" 2021-06-01 17:38:40 +02:00
virt nitro_enclaves: Fix stale file descriptors on failed usercopy 2021-04-29 19:06:49 +02:00
virtio virtio_balloon: specify page reporting order if needed 2021-06-29 10:53:47 -07:00
visorbus
vlynq
vme
w1 w1: ds28e17: Use module_w1_family to simplify the code 2021-04-10 10:58:21 +02:00
watchdog - Core Frameworks 2021-04-28 15:59:13 -07:00
xen xen/events: reset active flag for lateeoi events later 2021-06-24 12:52:36 +02:00
zorro
Kconfig staging: comedi: move out of staging directory 2021-04-15 09:26:25 +02:00
Makefile virtio,vhost,vdpa: features, fixes 2021-05-05 13:31:39 -07:00