1
0
mirror of https://github.com/systemd/systemd.git synced 2025-01-06 17:18:12 +03:00
Commit Graph

212 Commits

Author SHA1 Message Date
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Ryan Wilson
6746f28854 core: Migrate ProtectHostname to use enum vs boolean
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.

In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
2024-12-06 13:33:49 -08:00
Štěpán Němec
597c6cc119 man: fix incorrect volume numbers in internal man page references
Some ambiguity (e.g., same-named man pages in multiple volumes)
makes it impossible to fully automate this, but the following
Python snippet (run inside the man/ directory of the systemd repo)
helped to generate the sed command lines (which were subsequently
manually reviewed, run and the false positives reverted):

from pathlib import Path

import lxml
from lxml import etree as ET

man2vol: dict[str, str] = {}
man2citerefs: dict[str, list] = {}

for file in Path(".").glob("*.xml"):
    tree = ET.parse(file, lxml.etree.XMLParser(recover=True))
    meta = tree.find("refmeta")
    if meta is not None:
        title = meta.findtext("refentrytitle")
        if title is not None:
            vol = meta.findtext("manvolnum")
            if vol is not None:
                man2vol[title] = vol
            citerefs = list(tree.iter("citerefentry"))
            if citerefs:
                man2citerefs[title] = citerefs

for man, refs in man2citerefs.items():
    for ref in refs:
        title = ref.findtext("refentrytitle")
        if title is not None:
            has = ref.findtext("manvolnum")
            try:
                should_have = man2vol[title]
            except KeyError:  # Non-systemd man page reference?  Ignore.
                continue
            if has != should_have:
                print(
                    f"sed -i '\\|<citerefentry><refentrytitle>{title}"
                    f"</refentrytitle><manvolnum>{has}</manvolnum>"
                    f"</citerefentry>|s|<manvolnum>{has}</manvolnum>|"
                    f"<manvolnum>{should_have}</manvolnum>|' {man}.xml"
                )
2024-11-11 20:31:08 +01:00
Zbigniew Jędrzejewski-Szmek
fe45f8dc9b man: drop whitespace from final <programlisting> lines
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
2024-11-08 14:14:36 +01:00
Lennart Poettering
607d297487 man: link up D-Bus API docs from daemon man pages
Let's systematically make sure that we link up the D-Bus interfaces from
the daemon man pages once in prose and once in short form at the bottom
("See Also"), for all daemons.

Also, add reverse links at the bottom of the D-Bus API docs.

Fixes: #34996
2024-11-05 22:57:51 +01:00
Daan De Meyer
406f177501 core: Introduce PrivatePIDs=
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.

Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.

We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.

When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.

Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.

Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.

Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
2024-11-05 05:32:02 -08:00
Luca Boccassi
890bdd1d77 core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
2024-11-01 10:46:55 +00:00
Ryan Wilson
cd58b5a135 cgroup: Add support for ProtectControlGroups= private and strict
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.

We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.

Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.

Fixes: #34634
2024-10-28 08:37:36 -07:00
Mike Yuan
7e40b51a2e
man/org.freedesktop.systemd1: complete version info for ManagedOOMMemoryPressureDurationUSec
Follow-up for 63d4c4271c

Some unit types were left out.
2024-10-22 19:12:27 +02:00
Ryan Wilson
63d4c4271c cgroup: Add ManagedOOMMemoryPressureDurationSec= override setting for units
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.

The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second

Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
2024-10-16 20:12:38 -07:00
Arthur Shau
cc0ab8c810 timer: introduce DeferReactivation setting
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.

This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.

Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.

Co-authored-by: Matteo Croce <teknoraver@meta.com>
2024-10-11 22:54:16 +02:00
Lennart Poettering
0aaacc3a10
Merge pull request #34593 from Werkov/deprecate-aux-scopes
core/manager: Deprecate StartAuxiliaryScope() method
2024-10-09 10:25:30 +02:00
Michal Koutný
64f173324e core/manager: Deprecate StartAuxiliaryScope() method
The method was added with migration of resources in mind (e.g. process's
allocated memory will follow it to the new scope), however, such a
resource migration is not in cgroup semantics. The method may thus have
the intended users and others could be guided to StartTransientUnit().

Since this API was advertised in a regular release, start the removal
with a deprecation message to callers.
Eventually, the goal is to remove the method to clean up DBus API and
simplify code (removal of cgroup_context_copy()).

Part of DBus docs is retained to satisfy build checks.
2024-10-08 17:49:13 +02:00
Ryan Wilson
3543456f84 Add ExtraFileDescriptor property to StartTransientUnit dbus API
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.

systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.

We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.

Fixes: #34396
2024-10-07 09:01:48 -07:00
Luca Boccassi
3509fe124d man: consolidate list of active unit states into a shared table
Avoids the need to maintain the same list over and over again, and
link it to the defition table in the implementation as a reminder
too
2024-10-04 12:22:55 +02:00
Jason Yundt
dfb3155419 man: document ShowStatus and SetShowStatus()
SetShowStatus() was added in order to fix #11447. Recently, I ran into
the exact same problem that OP was experiencing in #11447. I wasn’t able
to figure out how to deal with the problem until I found #11447, and it
took me a while to find #11447.

This commit takes what I learned from reading #11447 and adds it to the
documentation. Hopefully, this will make it easier for other people who
run into the same problem in the future.
2024-09-18 10:11:55 +02:00
Daan De Meyer
fa693fdc7e core: Add support for PrivateUsers=identity
This configures an indentity mapping similar to
systemd-nspawn --private-users=identity.
2024-09-09 18:31:01 +02:00
Takeo Kondo
71f43fc882 man: CAP_SYS_ADMIN does NOT grant any permission for dbus API 2024-09-06 21:16:53 +02:00
Mike Yuan
7a9f0125bb
core: rename BindJournalSockets= to BindLogSockets=
Addresses https://github.com/systemd/systemd/pull/32487#issuecomment-2328465309
2024-09-04 21:44:25 +02:00
Mike Yuan
368a3071e9
core: introduce BindJournalSockets=
Closes #32478
2024-09-03 21:04:50 +02:00
Luca Boccassi
5162829ec8 core: do BindMount/MountImage operations in async control process
These operations might require slow I/O, and thus might block PID1's main
loop for an undeterminated amount of time. Instead of performing them
inline, fork a worker process and stash away the D-Bus message, and reply
once we get a SIGCHILD indicating they have completed. That way we don't
break compatibility and callers can continue to rely on the fact that when
they get the method reply the operation either succeeded or failed.

To keep backward compatibility, unlike reload control processes, these
are ran inside init.scope and not the target cgroup. Unlike ExecReload,
this is under our control and is not defined by the unit. This is necessary
because previously the operation also wasn't ran from the target cgroup,
so suddenly forking a copy-on-write copy of pid1 into the target cgroup
will make memory usage spike, and if there is a MemoryMax= or MemoryHigh=
set and the cgroup is already close to the limit, it will cause an OOM
kill, where previously it would have worked fine.
2024-08-29 12:48:55 +01:00
Luca Boccassi
7d8bbfbe08 service: add 'debug' option to RestartMode=
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.

This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
2024-08-27 12:24:45 +01:00
Daan De Meyer
831f208783 core: Add support for renaming credentials with ImportCredential=
This allows for "per-instance" credentials for units. The use case
is best explained with an example. Currently all our getty units
have the following stanzas in their unit file:

"""
ImportCredential=agetty.*
ImportCredential=login.*
"""

This means that setting agetty.autologin=root as a system credential
will make every instance of our all our getty units autologin as the
root user. This prevents us from doing autologin on /dev/hvc0 while
still requiring manual login on all other ttys.

To solve the issue, we introduce support for renaming credentials with
ImportCredential=. This will allow us to add the following to e.g.
serial-getty@.service:

"""
ImportCredential=tty.serial.%I.agetty.*:agetty.
ImportCredential=tty.serial.%I.login.*:login.
"""

which for serial-getty@hvc0.service will make the service manager read
all credentials of the form "tty.serial.hvc0.agetty.xxx" and pass them
to the service in the form "agetty.xxx" (same goes for login). We can
apply the same to each of the getty units to allow setting agetty and
login credentials for individual ttys instead of globally.
2024-07-31 15:52:27 +02:00
Mike Yuan
9d50d053f3
core: expose PrivateTmp=disconnected
As discussed in https://github.com/systemd/systemd/pull/32724#discussion_r1638963071

I don't find the opposite reasoning particularly convincing.
We have ProtectHome=tmpfs and friends, and those can be
pretty much trivially implemented through TemporaryFileSystem=
too. The new logic brings many benefits, and is completely generic,
hence I see no reason not to expose it. We can even get more tests
for the code path if we make it public.
2024-06-21 17:31:44 +02:00
Mike Yuan
c3662116b9
man/org.freedesktop.systemd1: Status{Bus,Varlink}Error belongs to Service, not Scope
Follow-up for 9c025022d9

Ugh, shouldn't have done this bit when I was sleepy...
2024-06-21 16:47:28 +02:00
Mike Yuan
9c025022d9
core/service: store BUSERROR= & VARLINKERROR= received through notification
Closes #6073
2024-06-20 19:03:44 +02:00
Zbigniew Jędrzejewski-Szmek
75ced6d5ee various: update links to usr-merge 2024-05-28 14:48:56 +02:00
Zbigniew Jędrzejewski-Szmek
f81af0b082 man: update links to "New Control Group Interfaces" 2024-05-28 14:46:44 +02:00
Lennart Poettering
3c1d1ca146 manager: switch service unit type over to using new handoff timestamping logic
Also: rename Handover → Handoff. I think it makes it clearer that this
is not really about handing over any resources, but that the executor is
out off the game from that point on.
2024-04-25 13:40:41 +02:00
Luca Boccassi
c75c8a38b8 man: document service types that record ExecMainHandoverTimestamp
Follow-up for 93cb78aee2
2024-04-24 07:55:37 +02:00
Mike Yuan
844863c61e
core/manager: add unmerged-bin taint 2024-04-24 08:43:08 +08:00
Mike Yuan
ea81442892
core/manager: rearrange taint tags 2024-04-24 08:40:25 +08:00
Mike Yuan
2b28dfe6e6
core/manager: drop obsolete cgroup taint string
Wwe can't boot on systems without cgroup anyway
(even cgroup v1 will be gone pretty soon).
2024-04-24 08:39:29 +08:00
Luca Boccassi
93cb78aee2 core: add ExecMainHandoverTimestamp property recording time-of-execve
Enable the exec_fd logic for Type=notify* services too, and change it
to send a timestamp instead of a '1' byte. Record the timestamp in a
new ExecMainHandoverTimestamp property so that users can track accurately
when control is handed over from systemd to the service payload, so
that latency and startup performance can be trivially and accurately
tracked and attributed.
2024-04-22 15:16:05 +02:00
Luca Boccassi
ef5f7f9437 systemctl: add --clean= values to documentation and shell completion 2024-04-18 14:07:07 +02:00
Luca Boccassi
b3f548615f core: rename SoftRebootStartTimestamp -> ShutdownStartTimestamp and generalize
Follow-up for 54f86b86ba
2024-04-17 18:19:27 +01:00
Luca Boccassi
95a289bfe7 man: mention initial value of SoftRebootsCount
Follow-up for 66f35161f6
2024-04-16 00:26:04 +01:00
Luca Boccassi
66f35161f6 core: add counter for soft-reboot iterations
Allow to query via D-Bus how many times the current booted system has
been soft rebooted
2024-03-27 01:27:35 +00:00
Luca Boccassi
54f86b86ba core: add SoftRebootStartTimestamp
Will be useful to calculate how long it took to shut down the system before starting
in the new root
2024-03-27 01:25:49 +00:00
Jakub Sitnicki
97df75d7bd socket: pass socket FDs to all ExecXYZ= commands but ExecStartPre=
Today listen file descriptors created by socket unit don't get passed to
commands in Exec{Start,Stop}{Pre,Post}= socket options.

This prevents ExecXYZ= commands from accessing the created socket FDs to do
any kind of system setup which involves the socket but is not covered by
existing socket unit options.

One concrete example is to insert a socket FD into a BPF map capable of
holding socket references, such as BPF sockmap/sockhash [1] or
reuseport_sockarray [2]. Or, similarly, send the file descriptor with
SCM_RIGHTS to another process, which has access to a BPF map for storing
sockets.

To unblock this use case, pass ListenXYZ= file descriptors to ExecXYZ=
commands as listen FDs [4]. As an exception, ExecStartPre= command does not
inherit any file descriptors because it gets invoked before the listen FDs
are created.

This new behavior can potentially break existing configurations. Commands
invoked from ExecXYZ= might not expect to inherit file descriptors through
sd_listen_fds protocol.

To prevent breakage, add a new socket unit parameter,
PassFileDescriptorsToExec=, to control whether ExecXYZ= programs inherit
listen FDs.

[1] https://docs.kernel.org/bpf/map_sockmap.html
[2] https://lore.kernel.org/r/20180808075917.3009181-1-kafai@fb.com
[3] https://man.archlinux.org/man/socket.7#SO_INCOMING_CPU
[4] https://www.freedesktop.org/software/systemd/man/latest/sd_listen_fds.html
2024-03-27 01:41:26 +08:00
Mike Yuan
1ea275f119 core/cgroup: introduce MemoryZSwapWriteback setting
Added in
501a06fe8e
2024-03-13 23:36:25 +00:00
Frantisek Sumsal
43b238f1c1 man: suffix signals with ()
Since signals can take arguments, let's suffix them with () as we
already do with functions. To make sure we remain consistent, make the
`update-dbus-docs.py` script check & fix any occurrences where this is
not the case.

Resolves: #31002
2024-01-23 16:27:50 +01:00
Luca Boccassi
d156e66f82 man: add more suggestions on how to use StartUnit and JobRemoved
This is not immediately clear for users, so spell out the preferred pattern
clearly in the D-Bus documentation.
2024-01-18 17:22:12 +00:00
Lennart Poettering
658dc909dc man: fix references to systemd.exec(5)
For some reason the section for the systemd.exec man page was added
incorrectly and then copypasted everywhere else incorrectly too. Let's
fix that.
2024-01-11 12:19:44 +00:00
Yu Watanabe
82a1597778
Merge pull request #28797 from Werkov/eff_limits
Add MemoryMaxEffective=, MemoryHighEffective= and TasksMaxEff…  …ective= properties
2024-01-04 05:38:06 +09:00
Michal Sekletar
84c01612de core/manager: add dbus API to create auxiliary scope from running service
This commit introduces new D-Bus API, StartAuxiliaryScope(). It may be
used by services as part of the restart procedure. Service sends an
array of PID file descriptors corresponding to processes that are part
of the service and must continue running also after service restarts,
i.e. they haven't finished the job why they were spawned in the first
place (e.g. long running video transcoding job). Systemd creates new
scope unit for these processes and migrates them into it. Cgroup
properties of scope are copied from the service so it retains same
cgroup settings and limits as service had.
2024-01-03 13:50:41 +01:00
Michal Koutný
4fb0d2dc14 cgroup: Add EffectiveMemoryMax=, EffectiveMemoryHigh= and EffectiveTasksMax= properties
Users become perplexed when they run their workload in a unit with no
explicit limits configured (moreover, listing the limit property would
even show it's infinity) but they experience unexpected resource
limitation.

The memory and pid limits come as the most visible, therefore add new
unit read-only properties:
- EffectiveMemoryMax=,
- EffectiveMemoryHigh=,
- EffectiveTasksMax=.

These properties represent the most stringent limit systemd is aware of
for the given unit -- and that is typically(*) the effective value.

Implement the properties by simply traversing all parents in the
leaf-slice tree and picking the minimum value. Note that effective
limits are thus defined even for units that don't enable explicit
accounting (because of the hierarchy).

(*) The evasive case is when systemd runs in a cgroupns and cannot
reason about outer setup. Complete solution would need kernel support.
2024-01-03 13:37:08 +01:00
David Tardon
eea10b26f7 man: use same version in public and system ident. 2023-12-25 15:51:47 +01:00
Andrew Sayers
ff47602f5e Fix a typo in the org.freedesktop.systemd1 man page 2023-12-15 07:39:05 +09:00
Luca Boccassi
9e615fa3aa core: add WantsMountsFor=
This is the equivalent of RequiresMountsFor=, but adds Wants= instead
of Requires=. It will be useful for example for the autogenerated
systemd-cryptsetup units.

Fixes https://github.com/systemd/systemd/issues/11646
2023-11-29 11:04:59 +00:00