1
0
mirror of https://github.com/systemd/systemd.git synced 2025-01-03 05:18:09 +03:00
Commit Graph

641 Commits

Author SHA1 Message Date
Lennart Poettering
00a415fc8f tree-wide: remove support for kernels lacking ambient caps
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.

This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
2024-12-17 17:34:46 +01:00
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Luca Boccassi
6dfd290031
core: Add PrivateUsers=full (#35183)
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.

Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.

In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.

Fixes: #35168
Fixes: #35425
2024-12-13 12:25:13 +00:00
Ryan Wilson
2665425176 core: Set /proc/pid/setgroups to allow for PrivateUsers=full
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.

The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.

However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.

Fixes: #35425
2024-12-12 11:36:10 +00:00
Yu Watanabe
627d1a9ac1
core: Add ProtectHostname=private (#35447)
This PR allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host. This is useful for OS-like
containers running as units where they should have freedom to change
their container hostname if they want, but not the host's hostname.

Fixes: #30348
2024-12-11 10:17:25 +09:00
Ryan Wilson
219a6dbbf3 core: Fix time namespace in RestrictNamespaces=
RestrictNamespaces= would accept "time" but would not actually apply
seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true
should fail but it succeeded.

This commit actually enables time namespace seccomp filtering.
2024-12-10 20:55:26 +01:00
Ryan Wilson
cf48bde7ae core: Add ProtectHostname=private
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.

Fixes: #30348
2024-12-06 13:34:04 -08:00
Ryan Wilson
705cc82938 core: Add PrivateUsers=full
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Fixes: #35168
2024-12-05 10:34:32 -08:00
Septatrix
5857f31c2c man: clarify wording regarding MONITOR_* envs 2024-12-06 03:01:19 +09:00
Zbigniew Jędrzejewski-Szmek
fe45f8dc9b man: drop whitespace from final <programlisting> lines
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
2024-11-08 14:14:36 +01:00
Lennart Poettering
b711737096 man: document that PrivateTmp= is unaffected by ProtectSystem=strict
Fixes: #33130
2024-11-05 22:57:51 +01:00
Lennart Poettering
ecbe9ae5a0 man: don't claim SELinuxContext= only worked in the system service manager
Fixes: #34840
2024-11-05 22:42:38 +01:00
Daan De Meyer
406f177501 core: Introduce PrivatePIDs=
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.

Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.

We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.

When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.

Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.

Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.

Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
2024-11-05 05:32:02 -08:00
Andres Beltran
eae5127246 core: add id-mapped mount support for Exec directories 2024-11-01 18:45:28 +00:00
Luca Boccassi
890bdd1d77 core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
2024-11-01 10:46:55 +00:00
Ryan Wilson
cd58b5a135 cgroup: Add support for ProtectControlGroups= private and strict
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.

We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.

Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.

Fixes: #34634
2024-10-28 08:37:36 -07:00
Yu Watanabe
edd3f4d9b7 core: drop implicit support of PrivateUsers=off
Follow-up for fa693fdc7e.

The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
2024-10-09 05:39:54 +09:00
Jason Yundt
dfb3155419 man: document ShowStatus and SetShowStatus()
SetShowStatus() was added in order to fix #11447. Recently, I ran into
the exact same problem that OP was experiencing in #11447. I wasn’t able
to figure out how to deal with the problem until I found #11447, and it
took me a while to find #11447.

This commit takes what I learned from reading #11447 and adds it to the
documentation. Hopefully, this will make it easier for other people who
run into the same problem in the future.
2024-09-18 10:11:55 +02:00
Daan De Meyer
fa693fdc7e core: Add support for PrivateUsers=identity
This configures an indentity mapping similar to
systemd-nspawn --private-users=identity.
2024-09-09 18:31:01 +02:00
Mike Yuan
7a9f0125bb
core: rename BindJournalSockets= to BindLogSockets=
Addresses https://github.com/systemd/systemd/pull/32487#issuecomment-2328465309
2024-09-04 21:44:25 +02:00
Mike Yuan
368a3071e9
core: introduce BindJournalSockets=
Closes #32478
2024-09-03 21:04:50 +02:00
Luca Boccassi
7d8bbfbe08 service: add 'debug' option to RestartMode=
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.

This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
2024-08-27 12:24:45 +01:00
Daan De Meyer
831f208783 core: Add support for renaming credentials with ImportCredential=
This allows for "per-instance" credentials for units. The use case
is best explained with an example. Currently all our getty units
have the following stanzas in their unit file:

"""
ImportCredential=agetty.*
ImportCredential=login.*
"""

This means that setting agetty.autologin=root as a system credential
will make every instance of our all our getty units autologin as the
root user. This prevents us from doing autologin on /dev/hvc0 while
still requiring manual login on all other ttys.

To solve the issue, we introduce support for renaming credentials with
ImportCredential=. This will allow us to add the following to e.g.
serial-getty@.service:

"""
ImportCredential=tty.serial.%I.agetty.*:agetty.
ImportCredential=tty.serial.%I.login.*:login.
"""

which for serial-getty@hvc0.service will make the service manager read
all credentials of the form "tty.serial.hvc0.agetty.xxx" and pass them
to the service in the form "agetty.xxx" (same goes for login). We can
apply the same to each of the getty units to allow setting agetty and
login credentials for individual ttys instead of globally.
2024-07-31 15:52:27 +02:00
Lennart Poettering
c06b84d816 man: clarify what TTYReset= and TTYVTDisallocate= do and do not do regarding screen clearing 2024-07-19 11:44:04 +02:00
Mike Yuan
9d50d053f3
core: expose PrivateTmp=disconnected
As discussed in https://github.com/systemd/systemd/pull/32724#discussion_r1638963071

I don't find the opposite reasoning particularly convincing.
We have ProtectHome=tmpfs and friends, and those can be
pretty much trivially implemented through TemporaryFileSystem=
too. The new logic brings many benefits, and is completely generic,
hence I see no reason not to expose it. We can even get more tests
for the code path if we make it public.
2024-06-21 17:31:44 +02:00
Maximilian Wilhelm
163bb43cea man/systemd.exec: list inaccessible files for ProtectKernelTunables 2024-06-20 03:00:59 +09:00
Luca Boccassi
0e551b04ef core: do not imply PrivateTmp with DynamicUser, create a private tmpfs instead
DynamicUser= enables PrivateTmp= implicitly to avoid files owned by reusable uids
leaking into the host. Change it to instead create a fully private tmpfs instance
instead, which also ensures the same result, since it has less impactful semantics
with respect to PrivateTmp=yes, which links the mount namespace to the host's /tmp
instead. If a user specifies PrivateTmp manually, let the existing behaviour
unchanged to ensure backward compatibility is not broken.
2024-06-17 17:05:55 +01:00
Kamil Szczęk
608bfe76c1 core: populate $REMOTE_ADDR for AF_UNIX sockets
Set the $REMOTE_ADDR environment variable for AF_UNIX socket connections
when using per-connection socket activation (Accept=yes). $REMOTE_ADDR
will now contain the remote socket's file system path (starting with a
slash "/") or its address in the abstract namespace (starting with an
at symbol "@").

This information is essential for identifying the remote peer in AF_UNIX
socket connections, but it's not easy to obtain in a shell script for
example without pulling in a ton of additional tools. By setting
$REMOTE_ADDR, we make this information readily available to the
activated service.
2024-06-12 00:11:10 +01:00
Mike Yuan
6b34871f5d
core/exec-credential: complain louder if inherited credential is missing
Also document that a missing inherited credential
is not considered fatal.

Closes #32667
2024-05-07 22:02:42 +08:00
Mike Yuan
45a36ecff9
man/systemd.exec: mount_switch_root uses pivot_root rather than chroot 2024-04-27 14:28:54 +08:00
Guido Leenders
f445ed3c5f Document effective owner of stdout/stderr log file upon creation
The log files defined using file:, append: or truncate: inherit the owner and other privileges from the effective user running systemd.

The log files are NOT created using the "User", "Group" or "UMask" defined in the service.
2024-04-22 20:46:25 +02:00
Yu Watanabe
6bd3102e3e man: fix typo
Follow-up for fef46ffb5b.
2024-04-23 01:42:11 +09:00
Lennart Poettering
fef46ffb5b man: document that ReadOnlyPaths= doesn't affect ability to connect to AF_UNIX
Fixes: #23470
2024-04-22 15:16:54 +02:00
Lennart Poettering
04366e0693 man: document that StateDirectory= trumps ProtectSystem=strict explicitly
Fixes: #29798
2024-04-22 15:16:54 +02:00
Lennart Poettering
552dc4a97c man: document explicitly that LogExtraFields= and LogFilterPatterns= are for system service only for now
Fixes: #29956
2024-04-22 15:16:54 +02:00
Lennart Poettering
3c7f0d6b44 man: explicitly say that BindPaths=/BindReadOnlyPaths= opens a new mount
namespace

Fixes: #32339
2024-04-22 15:16:54 +02:00
Frantisek Sumsal
ad444dd8e8 man: slightly reword LogFilterPatterns= description
As there was something missing in the existing sentence.
2024-04-15 17:16:18 +02:00
Ole Peder Brandtzæg
712514416e man: remove PrivateMounts= from list of other settings in its own description
The diff looks bigger, but that's only because it seemed fitting to
reformat the paragraph now that the list is shorter.
2024-04-14 08:04:12 +09:00
Luca Boccassi
622efc544d core: add support for vpick for ExtensionDirectories= 2024-02-17 11:20:00 +00:00
Luca Boccassi
5e79dd96a8 core: add support for vpick for ExtensionImages= 2024-02-17 11:20:00 +00:00
Luca Boccassi
7fa428cf44 man: create reusable snippet for 'vpick' entries 2024-02-17 11:20:00 +00:00
Winterhuman
6c6ec5f728 Improve IgnoreSIGPIPE description
Reword the description of the `IgnoreSIGPIPE=` service option to be more grammatical.
2024-02-14 17:31:18 +00:00
Zbigniew Jędrzejewski-Szmek
f7862b2a00 tree-wide: use normal spelling of "reopen"
It's a commonly used verb meaning "to open again".
2024-02-09 17:57:41 +01:00
Lennart Poettering
7704c3474d man: document new user-scoped credentials 2024-01-30 17:07:47 +01:00
Lennart Poettering
7d93e4af80 man: document the new vpick concept 2024-01-03 18:38:46 +01:00
David Tardon
eea10b26f7 man: use same version in public and system ident. 2023-12-25 15:51:47 +01:00
David Tardon
13a69c120b man: use <simplelist> for 'See also' sections
This is just a slight markup improvement; there should be no difference
in rendering.
2023-12-23 08:28:57 +01:00
Lennart Poettering
d1a5be82ef core: imply SetLoginEnvironment= if PAMName= is set
This geneally makes sense as setting up a PAM session pretty much
defines what a login session is.

In context of #30547 this has the benefit that we can take benefit of
the SetLoginEnvironment= effect without having to set it explicitly,
thus retaining some compat of the uid0 client towards older systemd
service managers.
2023-12-21 10:14:21 +01:00
Lennart Poettering
b6be6a6721 man: document explicitly tha ReadWritePaths= cannot undo superblock read-only settings
Fixes: #29266
2023-11-09 09:39:12 +01:00
Luca Boccassi
00666ec71f
Merge pull request #6763 from kinvolk/iaguis/no-new-privs
core: allow using seccomp without no_new_privs when unprivileged
2023-11-07 21:34:49 +00:00