From cd20d48c69f9e586de914e1facf33b11122477ae Mon Sep 17 00:00:00 2001 From: Ivan Kruglov <mail@ikruglov.com> Date: Wed, 19 Feb 2025 03:14:20 -0800 Subject: [PATCH] docs: clarify userns mapping when /proc/sys is rw --- docs/CONTAINER_INTERFACE.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/CONTAINER_INTERFACE.md b/docs/CONTAINER_INTERFACE.md index 6c9ea9a25ad..2a823218814 100644 --- a/docs/CONTAINER_INTERFACE.md +++ b/docs/CONTAINER_INTERFACE.md @@ -24,16 +24,19 @@ manager, please consider supporting the following interfaces. invoking systemd, and mount `/sys/`, `/sys/fs/selinux/` and `/proc/sys/` read-only (the latter via e.g. a read-only bind mount on itself) in order to prevent the container from altering the host kernel's configuration - settings. (As a special exception, if your container has network namespaces + settings. As a special exception, if your container has network namespaces enabled, feel free to make `/proc/sys/net/` writable. If it also has user, ipc, - uts and pid namespaces enabled, the entire `/proc/sys` can be left writable). - systemd and various other subsystems (such as the SELinux userspace) have - been modified to behave accordingly when these file systems are read-only. - (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a subset of its - sub-trees from the real `sysfs` to hide `/sys/firmware/`, `/sys/kernel/` and - so on. If you do that, make sure to mark `/sys/` read-only, as that - condition is what systemd looks for, and is what is considered to be the API - in this context.) + uts and pid namespaces enabled, the entire `/proc/sys` can be left writable. + However, in the latter case, an appropriate userns mapping should exist to + map the root user inside the container to an unprivileged user on the + host. Otherwise, the root user inside the container could modify the host's + kernel settings. systemd and various other subsystems (such as the SELinux + userspace) have been modified to behave accordingly when these file systems + are read-only. (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a + subset of its sub-trees from the real `sysfs` to hide `/sys/firmware/`, + `/sys/kernel/` and so on. If you do that, make sure to mark `/sys/` + read-only, as that condition is what systemd looks for, and is what is + considered to be the API in this context.) 3. Pre-mount `/dev/` as (container private) `tmpfs` for the container and bind mount some suitable TTY to `/dev/console`. If this is a pty, make sure to