systemd

mirror of https://github.com/systemd/systemd.git synced 2025-01-03 05:18:09 +03:00

Author	SHA1	Message	Date
Lennart Poettering	00a415fc8f	tree-wide: remove support for kernels lacking ambient caps Let's bump the kernel baseline a bit to 4.3 and thus require ambient caps. This allows us to remove support for a variety of special casing, most importantly the ExecStart=!! hack.	2024-12-17 17:34:46 +01:00
Yu Watanabe	e76fcd0e40	core: make ProtectHostname= optionally take a hostname Closes #35623.	2024-12-16 23:55:44 +09:00
Luca Boccassi	6dfd290031	core: Add PrivateUsers=full (#35183 ) Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 (https://github.com/systemd/systemd/pull/35382) and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259 for version N-1 compatibility. In addition to mapping the whole UID/GID space, we also set /proc/pid/setgroups to "allow". While we usually set "deny" to avoid security issues with dropping supplementary groups (https://lwn.net/Articles/626665/), this ends up breaking dbus-broker when running /sbin/init in full OS containers. Fixes: #35168 Fixes: #35425	2024-12-13 12:25:13 +00:00
Ryan Wilson	2665425176	core: Set /proc/pid/setgroups to allow for PrivateUsers=full When trying to run dbus-broker in a systemd unit with PrivateUsers=full, we see dbus-broker fails with EPERM at `util_audit_drop_permissions`. The root cause is dbus-broker calls the setgroups() system call and this is disallowed via systemd's implementation of PrivateUsers= by setting /proc/pid/setgroups = deny. This is done to remediate potential privilege escalation vulnerabilities in user namespaces where an attacker can remove supplementary groups and gain access to resources where those groups are restricted. However, for OS-like containers, setgroups() is a pretty common API and disabling it is not feasible. So we allow setgroups() by setting /proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious users can still use SystemCallFilter= to disable setgroups() if they want to specifically prevent this system call. Fixes: #35425	2024-12-12 11:36:10 +00:00
Yu Watanabe	627d1a9ac1	core: Add ProtectHostname=private (#35447 ) This PR allows an option for systemd exec units to enable UTS namespaces but not restrict changing hostname via seccomp. Thus, units can change hostname without affecting the host. This is useful for OS-like containers running as units where they should have freedom to change their container hostname if they want, but not the host's hostname. Fixes: #30348	2024-12-11 10:17:25 +09:00
Ryan Wilson	219a6dbbf3	core: Fix time namespace in RestrictNamespaces= RestrictNamespaces= would accept "time" but would not actually apply seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true should fail but it succeeded. This commit actually enables time namespace seccomp filtering.	2024-12-10 20:55:26 +01:00
Ryan Wilson	cf48bde7ae	core: Add ProtectHostname=private This allows an option for systemd exec units to enable UTS namespaces but not restrict changing hostname via seccomp. Thus, units can change hostname without affecting the host. Fixes: #30348	2024-12-06 13:34:04 -08:00
Ryan Wilson	705cc82938	core: Add PrivateUsers=full Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Fixes: #35168	2024-12-05 10:34:32 -08:00
Septatrix	5857f31c2c	man: clarify wording regarding MONITOR_* envs	2024-12-06 03:01:19 +09:00
Zbigniew Jędrzejewski-Szmek	fe45f8dc9b	man: drop whitespace from final <programlisting> lines In the troff output, this doesn't seem to make any difference. But in the html output, the whitespace is sometimes preserved, creating an additional gap before the following content. Drop it everywhere to avoid this.	2024-11-08 14:14:36 +01:00
Lennart Poettering	b711737096	man: document that PrivateTmp= is unaffected by ProtectSystem=strict Fixes: #33130	2024-11-05 22:57:51 +01:00
Lennart Poettering	ecbe9ae5a0	man: don't claim SELinuxContext= only worked in the system service manager Fixes: #34840	2024-11-05 22:42:38 +01:00
Daan De Meyer	406f177501	core: Introduce PrivatePIDs= This new setting allows unsharing the pid namespace in a unit. Because you have to fork to get a process into a pid namespace, we fork in systemd-executor to get into the new pid namespace. The parent then sends the pid of the child process back to the manager and exits while the child process continues on with the rest of exec_invoke() and then executes the actual payload. Communicating the child pid is done via a new pidref socket pair that is set up on manager startup. We unshare the PID namespace right before the mount namespace so we mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes to mount procfs. When running unprivileged in a user session, user namespace is set up first to allow for PID namespace to be unshared. However, when running in privileged mode, we unshare the user namespace last to ensure the user namespace does not own the PID namespace and cannot break out of the sandbox. Note we disallow Type=forking services from using PrivatePIDs=yes since the init proess inside the PID namespace must not exit for other processes in the namespace to exist. Note Daan De Meyer did the original work for this commit with Ryan Wilson addressing follow-ups. Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>	2024-11-05 05:32:02 -08:00
Andres Beltran	eae5127246	core: add id-mapped mount support for Exec directories	2024-11-01 18:45:28 +00:00
Luca Boccassi	890bdd1d77	core: add read-only flag for exec directories When an exec directory is shared between services, this allows one of the service to be the producer of files, and the other the consumer, without letting the consumer modify the shared files. This will be especially useful in conjunction with id-mapped exec directories so that fully sandboxed services can share directories in one direction, safely.	2024-11-01 10:46:55 +00:00
Ryan Wilson	cd58b5a135	cgroup: Add support for ProtectControlGroups= private and strict This commit adds two settings private and strict to the ProtectControlGroups= property. Private will unshare the cgroup namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup. Strict does the same except the mount is read-only. Since the unit is running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's own cgroup. We also add a new dbus property ProtectControlGroupsEx which accepts strings instead of boolean. This will allow users to use private/strict via dbus and systemd-run in addition to service files. Note private and strict fall back to no and yes respectively if the kernel doesn't support cgroup2 or system is not using unified hierarchy. Fixes: #34634	2024-10-28 08:37:36 -07:00
Yu Watanabe	edd3f4d9b7	core: drop implicit support of PrivateUsers=off Follow-up for `fa693fdc7e`. The documentation says the option takes a boolean or one of the "self" and "identity". But the parser uses private_users_from_string() which also accepts "off". Let's drop the implicit support of "off".	2024-10-09 05:39:54 +09:00
Jason Yundt	dfb3155419	man: document ShowStatus and SetShowStatus() SetShowStatus() was added in order to fix #11447. Recently, I ran into the exact same problem that OP was experiencing in #11447. I wasn’t able to figure out how to deal with the problem until I found #11447, and it took me a while to find #11447. This commit takes what I learned from reading #11447 and adds it to the documentation. Hopefully, this will make it easier for other people who run into the same problem in the future.	2024-09-18 10:11:55 +02:00
Daan De Meyer	fa693fdc7e	core: Add support for PrivateUsers=identity This configures an indentity mapping similar to systemd-nspawn --private-users=identity.	2024-09-09 18:31:01 +02:00
Mike Yuan	7a9f0125bb	core: rename BindJournalSockets= to BindLogSockets= Addresses https://github.com/systemd/systemd/pull/32487#issuecomment-2328465309	2024-09-04 21:44:25 +02:00
Mike Yuan	368a3071e9	core: introduce BindJournalSockets= Closes #32478	2024-09-03 21:04:50 +02:00
Luca Boccassi	7d8bbfbe08	service: add 'debug' option to RestartMode= One of the major pait points of managing fleets of headless nodes is that when something fails at startup, unless debug level was already enabled (which usually isn't, as it's a firehose), one needs to manually enable it and pray the issue can be reproduced, which often is really hard and time consuming, just to get extra info. Usually the extra log messages are enough to triage an issue. This new option makes it so that when a service fails and is restarted due to Restart=, log level for that unit is set to debug, so that all setup code in pid1 and sd-executor logs at debug level, and also a new DEBUG_INVOCATION=1 env var is passed to the service itself, so that it knows it should start with a higher log level. Once the unit succeeds or reaches the rate limit the original level is restored.	2024-08-27 12:24:45 +01:00
Daan De Meyer	831f208783	core: Add support for renaming credentials with ImportCredential= This allows for "per-instance" credentials for units. The use case is best explained with an example. Currently all our getty units have the following stanzas in their unit file: """ ImportCredential=agetty.* ImportCredential=login.* """ This means that setting agetty.autologin=root as a system credential will make every instance of our all our getty units autologin as the root user. This prevents us from doing autologin on /dev/hvc0 while still requiring manual login on all other ttys. To solve the issue, we introduce support for renaming credentials with ImportCredential=. This will allow us to add the following to e.g. serial-getty@.service: """ ImportCredential=tty.serial.%I.agetty.:agetty. ImportCredential=tty.serial.%I.login.:login. """ which for serial-getty@hvc0.service will make the service manager read all credentials of the form "tty.serial.hvc0.agetty.xxx" and pass them to the service in the form "agetty.xxx" (same goes for login). We can apply the same to each of the getty units to allow setting agetty and login credentials for individual ttys instead of globally.	2024-07-31 15:52:27 +02:00
Lennart Poettering	c06b84d816	man: clarify what TTYReset= and TTYVTDisallocate= do and do not do regarding screen clearing	2024-07-19 11:44:04 +02:00
Mike Yuan	9d50d053f3	core: expose PrivateTmp=disconnected As discussed in https://github.com/systemd/systemd/pull/32724#discussion_r1638963071 I don't find the opposite reasoning particularly convincing. We have ProtectHome=tmpfs and friends, and those can be pretty much trivially implemented through TemporaryFileSystem= too. The new logic brings many benefits, and is completely generic, hence I see no reason not to expose it. We can even get more tests for the code path if we make it public.	2024-06-21 17:31:44 +02:00
Maximilian Wilhelm	163bb43cea	man/systemd.exec: list inaccessible files for ProtectKernelTunables	2024-06-20 03:00:59 +09:00
Luca Boccassi	0e551b04ef	core: do not imply PrivateTmp with DynamicUser, create a private tmpfs instead DynamicUser= enables PrivateTmp= implicitly to avoid files owned by reusable uids leaking into the host. Change it to instead create a fully private tmpfs instance instead, which also ensures the same result, since it has less impactful semantics with respect to PrivateTmp=yes, which links the mount namespace to the host's /tmp instead. If a user specifies PrivateTmp manually, let the existing behaviour unchanged to ensure backward compatibility is not broken.	2024-06-17 17:05:55 +01:00
Kamil Szczęk	608bfe76c1	core: populate $REMOTE_ADDR for AF_UNIX sockets Set the $REMOTE_ADDR environment variable for AF_UNIX socket connections when using per-connection socket activation (Accept=yes). $REMOTE_ADDR will now contain the remote socket's file system path (starting with a slash "/") or its address in the abstract namespace (starting with an at symbol "@"). This information is essential for identifying the remote peer in AF_UNIX socket connections, but it's not easy to obtain in a shell script for example without pulling in a ton of additional tools. By setting $REMOTE_ADDR, we make this information readily available to the activated service.	2024-06-12 00:11:10 +01:00
Mike Yuan	6b34871f5d	core/exec-credential: complain louder if inherited credential is missing Also document that a missing inherited credential is not considered fatal. Closes #32667	2024-05-07 22:02:42 +08:00
Mike Yuan	45a36ecff9	man/systemd.exec: mount_switch_root uses pivot_root rather than chroot	2024-04-27 14:28:54 +08:00
Guido Leenders	f445ed3c5f	Document effective owner of stdout/stderr log file upon creation The log files defined using file:, append: or truncate: inherit the owner and other privileges from the effective user running systemd. The log files are NOT created using the "User", "Group" or "UMask" defined in the service.	2024-04-22 20:46:25 +02:00
Yu Watanabe	6bd3102e3e	man: fix typo Follow-up for `fef46ffb5b`.	2024-04-23 01:42:11 +09:00
Lennart Poettering	fef46ffb5b	man: document that ReadOnlyPaths= doesn't affect ability to connect to AF_UNIX Fixes: #23470	2024-04-22 15:16:54 +02:00
Lennart Poettering	04366e0693	man: document that StateDirectory= trumps ProtectSystem=strict explicitly Fixes: #29798	2024-04-22 15:16:54 +02:00
Lennart Poettering	552dc4a97c	man: document explicitly that LogExtraFields= and LogFilterPatterns= are for system service only for now Fixes: #29956	2024-04-22 15:16:54 +02:00
Lennart Poettering	3c7f0d6b44	man: explicitly say that BindPaths=/BindReadOnlyPaths= opens a new mount namespace Fixes: #32339	2024-04-22 15:16:54 +02:00
Frantisek Sumsal	ad444dd8e8	man: slightly reword LogFilterPatterns= description As there was something missing in the existing sentence.	2024-04-15 17:16:18 +02:00
Ole Peder Brandtzæg	712514416e	man: remove PrivateMounts= from list of other settings in its own description The diff looks bigger, but that's only because it seemed fitting to reformat the paragraph now that the list is shorter.	2024-04-14 08:04:12 +09:00
Luca Boccassi	622efc544d	core: add support for vpick for ExtensionDirectories=	2024-02-17 11:20:00 +00:00
Luca Boccassi	5e79dd96a8	core: add support for vpick for ExtensionImages=	2024-02-17 11:20:00 +00:00
Luca Boccassi	7fa428cf44	man: create reusable snippet for 'vpick' entries	2024-02-17 11:20:00 +00:00
Winterhuman	6c6ec5f728	Improve IgnoreSIGPIPE description Reword the description of the `IgnoreSIGPIPE=` service option to be more grammatical.	2024-02-14 17:31:18 +00:00
Zbigniew Jędrzejewski-Szmek	f7862b2a00	tree-wide: use normal spelling of "reopen" It's a commonly used verb meaning "to open again".	2024-02-09 17:57:41 +01:00
Lennart Poettering	7704c3474d	man: document new user-scoped credentials	2024-01-30 17:07:47 +01:00
Lennart Poettering	7d93e4af80	man: document the new vpick concept	2024-01-03 18:38:46 +01:00
David Tardon	eea10b26f7	man: use same version in public and system ident.	2023-12-25 15:51:47 +01:00
David Tardon	13a69c120b	man: use <simplelist> for 'See also' sections This is just a slight markup improvement; there should be no difference in rendering.	2023-12-23 08:28:57 +01:00
Lennart Poettering	d1a5be82ef	core: imply SetLoginEnvironment= if PAMName= is set This geneally makes sense as setting up a PAM session pretty much defines what a login session is. In context of #30547 this has the benefit that we can take benefit of the SetLoginEnvironment= effect without having to set it explicitly, thus retaining some compat of the uid0 client towards older systemd service managers.	2023-12-21 10:14:21 +01:00
Lennart Poettering	b6be6a6721	man: document explicitly tha ReadWritePaths= cannot undo superblock read-only settings Fixes: #29266	2023-11-09 09:39:12 +01:00
Luca Boccassi	00666ec71f	Merge pull request #6763 from kinvolk/iaguis/no-new-privs core: allow using seccomp without no_new_privs when unprivileged	2023-11-07 21:34:49 +00:00

1 2 3 4 5 ...

641 Commits