From cd20d48c69f9e586de914e1facf33b11122477ae Mon Sep 17 00:00:00 2001
From: Ivan Kruglov <mail@ikruglov.com>
Date: Wed, 19 Feb 2025 03:14:20 -0800
Subject: [PATCH] docs: clarify userns mapping when /proc/sys is rw

---
 docs/CONTAINER_INTERFACE.md | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/docs/CONTAINER_INTERFACE.md b/docs/CONTAINER_INTERFACE.md
index 6c9ea9a25ad..2a823218814 100644
--- a/docs/CONTAINER_INTERFACE.md
+++ b/docs/CONTAINER_INTERFACE.md
@@ -24,16 +24,19 @@ manager, please consider supporting the following interfaces.
    invoking systemd, and mount `/sys/`, `/sys/fs/selinux/` and `/proc/sys/`
    read-only (the latter via e.g. a read-only bind mount on itself) in order
    to prevent the container from altering the host kernel's configuration
-   settings. (As a special exception, if your container has network namespaces
+   settings. As a special exception, if your container has network namespaces
    enabled, feel free to make `/proc/sys/net/` writable. If it also has user, ipc,
-   uts and pid namespaces enabled, the entire `/proc/sys` can be left writable).
-   systemd and various other subsystems (such as the SELinux userspace) have
-   been modified to behave accordingly when these file systems are read-only.
-   (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a subset of its
-   sub-trees from the real `sysfs` to hide `/sys/firmware/`, `/sys/kernel/` and
-   so on. If you do that, make sure to mark `/sys/` read-only, as that
-   condition is what systemd looks for, and is what is considered to be the API
-   in this context.)
+   uts and pid namespaces enabled, the entire `/proc/sys` can be left writable.
+   However, in the latter case, an appropriate userns mapping should exist to
+   map the root user inside the container to an unprivileged user on the
+   host. Otherwise, the root user inside the container could modify the host's
+   kernel settings. systemd and various other subsystems (such as the SELinux
+   userspace) have been modified to behave accordingly when these file systems
+   are read-only. (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a
+   subset of its sub-trees from the real `sysfs` to hide `/sys/firmware/`,
+   `/sys/kernel/` and so on. If you do that, make sure to mark `/sys/`
+   read-only, as that condition is what systemd looks for, and is what is
+   considered to be the API in this context.)
 
 3. Pre-mount `/dev/` as (container private) `tmpfs` for the container and bind
    mount some suitable TTY to `/dev/console`. If this is a pty, make sure to