From 56f47800d847deef0d3ffbeca5fd774e7819322b Mon Sep 17 00:00:00 2001 From: Benjamin Berg Date: Thu, 23 Jul 2020 12:56:32 +0200 Subject: [PATCH 1/2] mount-setup: Enable memory_recursiveprot for cgroup2 When available, enable memory_recursiveprot. Realistically it always makes sense to delegate MemoryLow= and MemoryMin= to all children of a slice/unit. The kernel option is not enabled by default as it might cause regressions in some setups. However, it is the better default in general, and it results in a more flexible and obvious behaviour. The alternative to using this option would be for user's to also set DefaultMemoryLow= on slices when assigning MemoryLow=. However, this makes the effect of MemoryLow= on some children less obvious, as it could result in a lower protection rather than increasing it. From the kernel documentation: memory_recursiveprot Recursively apply memory.min and memory.low protection to entire subtrees, without requiring explicit downward propagation into leaf cgroups. This allows protecting entire subtrees from one another, while retaining free competition within those subtrees. This should have been the default behavior but is a mount-option to avoid regressing setups relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels). This was added in kernel commit 8a931f801340c (mm: memcontrol: recursive memory.low protection), which became available in 5.7 and was subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash race condition in memory.low). --- src/core/mount-setup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/core/mount-setup.c b/src/core/mount-setup.c index feb88f3e6e..ad14fd6aa9 100644 --- a/src/core/mount-setup.c +++ b/src/core/mount-setup.c @@ -85,6 +85,8 @@ static const MountPoint mount_table[] = { #endif { "tmpfs", "/run", "tmpfs", "mode=755" TMPFS_LIMITS_RUN, MS_NOSUID|MS_NODEV|MS_STRICTATIME, NULL, MNT_FATAL|MNT_IN_CONTAINER }, + { "cgroup2", "/sys/fs/cgroup", "cgroup2", "nsdelegate,memory_recursiveprot", MS_NOSUID|MS_NOEXEC|MS_NODEV, + cg_is_unified_wanted, MNT_IN_CONTAINER|MNT_CHECK_WRITABLE }, { "cgroup2", "/sys/fs/cgroup", "cgroup2", "nsdelegate", MS_NOSUID|MS_NOEXEC|MS_NODEV, cg_is_unified_wanted, MNT_IN_CONTAINER|MNT_CHECK_WRITABLE }, { "cgroup2", "/sys/fs/cgroup", "cgroup2", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, From 29bb3d7fc4e6111fa35957326e3a62004d68f2a6 Mon Sep 17 00:00:00 2001 From: Benjamin Berg Date: Fri, 24 Jul 2020 13:17:23 +0200 Subject: [PATCH 2/2] man: Improve MemoryMin=/MemoryLow= description The description didn't really explain how the distribution mechanism works exactly and the relationship of leaf and slice units. Update the documentation and also explicitly explain the expected behaviour as it is created by the memory_recursiveprot cgroup2 mount option. --- man/systemd.resource-control.xml | 57 +++++++++++++------------------- 1 file changed, 23 insertions(+), 34 deletions(-) diff --git a/man/systemd.resource-control.xml b/man/systemd.resource-control.xml index 3ccb5c4927..744a5f98ce 100644 --- a/man/systemd.resource-control.xml +++ b/man/systemd.resource-control.xml @@ -261,53 +261,42 @@ - MemoryMin=bytes + MemoryMin=bytes, MemoryLow=bytes - Specify the memory usage protection of the executed processes in this unit. If the memory usages of - this unit and all its ancestors are below their minimum boundaries, this unit's memory won't be reclaimed. + Specify the memory usage protection of the executed processes in this unit. + When reclaiming memory, the unit is treated as if it was using less memory resulting in memory + to be preferentially reclaimed from unprotected units. + Using MemoryLow= results in a weaker protection where memory may still + be reclaimed to avoid invoking the OOM killer in case there is no other reclaimable memory. + + For a protection to be effective, it is generally required to set a corresponding + allocation on all ancestors, which is then distributed between children + (with the exception of the root slice). + Any MemoryMin= or MemoryLow= allocation that is not + explicitly distributed to specific children is used to create a shared protection for all children. + As this is a shared protection, the children will freely compete for the memory. Takes a memory size in bytes. If the value is suffixed with K, M, G or T, the specified memory size is parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. Alternatively, a percentage value may be specified, which is taken relative to the installed physical memory on the system. If assigned the special value infinity, all available memory is protected, which may be useful in order to always inherit all of the protection afforded by ancestors. - This controls the memory.min control group attribute. For details about this - control group attribute, see memory.min or memory.low control group attribute. + For details about this control group attribute, see Memory Interface Files. This setting is supported only if the unified control group hierarchy is used and disables MemoryLimit=. - Units may have their children use a default memory.min value by specifying - DefaultMemoryMin=, which has the same semantics as MemoryMin=. This setting - does not affect memory.min in the unit itself. - - - - - MemoryLow=bytes - - - Specify the best-effort memory usage protection of the executed processes in this unit. If the memory - usages of this unit and all its ancestors are below their low boundaries, this unit's memory won't be - reclaimed as long as memory can be reclaimed from unprotected units. - - Takes a memory size in bytes. If the value is suffixed with K, M, G or T, the specified memory size is - parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. Alternatively, a - percentage value may be specified, which is taken relative to the installed physical memory on the - system. If assigned the special value infinity, all available memory is protected, which may be - useful in order to always inherit all of the protection afforded by ancestors. - This controls the memory.low control group attribute. For details about this - control group attribute, see Memory Interface Files. - - This setting is supported only if the unified control group hierarchy is used and disables - MemoryLimit=. - - Units may have their children use a default memory.low value by specifying - DefaultMemoryLow=, which has the same semantics as MemoryLow=. This setting - does not affect memory.low in the unit itself. + Units may have their children use a default memory.min or + memory.low value by specifying DefaultMemoryMin= or + DefaultMemoryLow=, which has the same semantics as + MemoryMin= and MemoryLow=. + This setting does not affect memory.min or memory.low + in the unit itself. + Using it to set a default child allocation is only useful on kernels older than 5.7, + which do not support the memory_recursiveprot cgroup2 mount option.