IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
These operations might require slow I/O, and thus might block PID1's main
loop for an undeterminated amount of time. Instead of performing them
inline, fork a worker process and stash away the D-Bus message, and reply
once we get a SIGCHILD indicating they have completed. That way we don't
break compatibility and callers can continue to rely on the fact that when
they get the method reply the operation either succeeded or failed.
To keep backward compatibility, unlike reload control processes, these
are ran inside init.scope and not the target cgroup. Unlike ExecReload,
this is under our control and is not defined by the unit. This is necessary
because previously the operation also wasn't ran from the target cgroup,
so suddenly forking a copy-on-write copy of pid1 into the target cgroup
will make memory usage spike, and if there is a MemoryMax= or MemoryHigh=
set and the cgroup is already close to the limit, it will cause an OOM
kill, where previously it would have worked fine.
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.
This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
This allows for "per-instance" credentials for units. The use case
is best explained with an example. Currently all our getty units
have the following stanzas in their unit file:
"""
ImportCredential=agetty.*
ImportCredential=login.*
"""
This means that setting agetty.autologin=root as a system credential
will make every instance of our all our getty units autologin as the
root user. This prevents us from doing autologin on /dev/hvc0 while
still requiring manual login on all other ttys.
To solve the issue, we introduce support for renaming credentials with
ImportCredential=. This will allow us to add the following to e.g.
serial-getty@.service:
"""
ImportCredential=tty.serial.%I.agetty.*:agetty.
ImportCredential=tty.serial.%I.login.*:login.
"""
which for serial-getty@hvc0.service will make the service manager read
all credentials of the form "tty.serial.hvc0.agetty.xxx" and pass them
to the service in the form "agetty.xxx" (same goes for login). We can
apply the same to each of the getty units to allow setting agetty and
login credentials for individual ttys instead of globally.
As discussed in https://github.com/systemd/systemd/pull/32724#discussion_r1638963071
I don't find the opposite reasoning particularly convincing.
We have ProtectHome=tmpfs and friends, and those can be
pretty much trivially implemented through TemporaryFileSystem=
too. The new logic brings many benefits, and is completely generic,
hence I see no reason not to expose it. We can even get more tests
for the code path if we make it public.
Also: rename Handover → Handoff. I think it makes it clearer that this
is not really about handing over any resources, but that the executor is
out off the game from that point on.
Enable the exec_fd logic for Type=notify* services too, and change it
to send a timestamp instead of a '1' byte. Record the timestamp in a
new ExecMainHandoverTimestamp property so that users can track accurately
when control is handed over from systemd to the service payload, so
that latency and startup performance can be trivially and accurately
tracked and attributed.
Today listen file descriptors created by socket unit don't get passed to
commands in Exec{Start,Stop}{Pre,Post}= socket options.
This prevents ExecXYZ= commands from accessing the created socket FDs to do
any kind of system setup which involves the socket but is not covered by
existing socket unit options.
One concrete example is to insert a socket FD into a BPF map capable of
holding socket references, such as BPF sockmap/sockhash [1] or
reuseport_sockarray [2]. Or, similarly, send the file descriptor with
SCM_RIGHTS to another process, which has access to a BPF map for storing
sockets.
To unblock this use case, pass ListenXYZ= file descriptors to ExecXYZ=
commands as listen FDs [4]. As an exception, ExecStartPre= command does not
inherit any file descriptors because it gets invoked before the listen FDs
are created.
This new behavior can potentially break existing configurations. Commands
invoked from ExecXYZ= might not expect to inherit file descriptors through
sd_listen_fds protocol.
To prevent breakage, add a new socket unit parameter,
PassFileDescriptorsToExec=, to control whether ExecXYZ= programs inherit
listen FDs.
[1] https://docs.kernel.org/bpf/map_sockmap.html
[2] https://lore.kernel.org/r/20180808075917.3009181-1-kafai@fb.com
[3] https://man.archlinux.org/man/socket.7#SO_INCOMING_CPU
[4] https://www.freedesktop.org/software/systemd/man/latest/sd_listen_fds.html
Since signals can take arguments, let's suffix them with () as we
already do with functions. To make sure we remain consistent, make the
`update-dbus-docs.py` script check & fix any occurrences where this is
not the case.
Resolves: #31002
This commit introduces new D-Bus API, StartAuxiliaryScope(). It may be
used by services as part of the restart procedure. Service sends an
array of PID file descriptors corresponding to processes that are part
of the service and must continue running also after service restarts,
i.e. they haven't finished the job why they were spawned in the first
place (e.g. long running video transcoding job). Systemd creates new
scope unit for these processes and migrates them into it. Cgroup
properties of scope are copied from the service so it retains same
cgroup settings and limits as service had.
Users become perplexed when they run their workload in a unit with no
explicit limits configured (moreover, listing the limit property would
even show it's infinity) but they experience unexpected resource
limitation.
The memory and pid limits come as the most visible, therefore add new
unit read-only properties:
- EffectiveMemoryMax=,
- EffectiveMemoryHigh=,
- EffectiveTasksMax=.
These properties represent the most stringent limit systemd is aware of
for the given unit -- and that is typically(*) the effective value.
Implement the properties by simply traversing all parents in the
leaf-slice tree and picking the minimum value. Note that effective
limits are thus defined even for units that don't enable explicit
accounting (because of the hierarchy).
(*) The evasive case is when systemd runs in a cgroupns and cannot
reason about outer setup. Complete solution would need kernel support.
This is the equivalent of RequiresMountsFor=, but adds Wants= instead
of Requires=. It will be useful for example for the autogenerated
systemd-cryptsetup units.
Fixes https://github.com/systemd/systemd/issues/11646
In systemctl-show we only show current swap if ever swapped or non-zero. This
reduces the noise on swapless systems, that would otherwise always show a swap
value that never has the chance to become non-zero. It further reduces the
noise for services that never swapped.
Linux's Control Group v2 interfaces exposes memory.peak, which contains the
"max memory usage recorded for the cgroup and its descendants since the
creation of the cgroup."
This commit adds a new property "MemoryPeak" for units and makes "systemctl
show" display this value if it is available.
Fixes#29878.
Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
Before this commit, $USER, $HOME, $LOGNAME and $SHELL are only
set when User= is set for the unit. For system service, this
results in different behaviors depending on whether User=root is set.
$USER always makes sense on its own, so let's set it unconditionally.
Ideally $HOME should be set too, but it causes trouble when e.g. getty
passes '-p' to login(1), which then doesn't override $HOME. $LOGNAME and
$SHELL are more like "login environments", and are generally not
suitable for system services. Therefore, a new option SetLoginEnvironment=
is also added to control the latter three variables.
Fixes#23438
Replaces #8227
Add a new boolean for units, SurviveFinalKillSignal=yes/no. Units that
set it will not have their process receive the final sigterm/sigkill in
the shutdown phase.
This is implemented by checking if a process is part of a cgroup marked
with a user.survive_final_kill_signal xattr (or a trusted xattr if we
can't set a user one, which were added only in kernel v5.7 and are not
supported in CentOS 8).
New directive `NFTSet=` provides a method for integrating dynamic cgroup IDs
into firewall rules with NFT sets. The benefit of using this setting is to be
able to use control group as a selector in firewall rules easily and this in
turn allows more fine grained filtering. Also, NFT rules for cgroup matching
use numeric cgroup IDs, which change every time a service is restarted, making
them hard to use in systemd environment.
This option expects a whitespace separated list of NFT set definitions. Each
definition consists of a colon-separated tuple of source type (only "cgroup"),
NFT address family (one of "arp", "bridge", "inet", "ip", "ip6", or "netdev"),
table name and set name. The names of tables and sets must conform to lexical
restrictions of NFT table names. The type of the element used in the NFT filter
must be "cgroupsv2". When a control group for a unit is realized, the cgroup ID
will be appended to the NFT sets and it will be be removed when the control
group is removed. systemd only inserts elements to (or removes from) the sets,
so the related NFT rules, tables and sets must be prepared elsewhere in
advance. Failures to manage the sets will be ignored.
If the firewall rules are reinstalled so that the contents of NFT sets are
destroyed, command systemctl daemon-reload can be used to refill the sets.
Example:
```
table inet filter {
...
set timesyncd {
type cgroupsv2
}
chain ntp_output {
socket cgroupv2 != @timesyncd counter drop
accept
}
...
}
```
/etc/systemd/system/systemd-timesyncd.service.d/override.conf
```
[Service]
NFTSet=cgroup:inet:filter:timesyncd
```
```
$ sudo nft list set inet filter timesyncd
table inet filter {
set timesyncd {
type cgroupsv2
elements = { "system.slice/systemd-timesyncd.service" }
}
}
```
This adds a new "PollLimit" pair of settings to .socket units, very
similar to existing "TriggerLimit" logic. The differences are:
* PollLimit focusses on the polling on the sockets, and pauses that
temporarily if a ratelimit on that is reached. TriggerLimit otoh
focusses on the triggering effect of socket units, and stops
triggering once the ratelimit is hit.
* While the trigger limit being hit is an action that causes the socket
unit to fail the polling limit being reached will just temporarily
disable polling on the socket fd, and it is resumed once the ratelimit
interval is over.
* When a socket unit operates on multiple socket fds (e,g, ListenStream=
on both some ipv6 and an ipv4 address or so). Then the PollLimit will
be specific to each fd, while the trigger limit is specific to the
whole unit.
Implementation-wise this is mostly a wrapper around sd-event's
sd_event_source_set_ratelimit(), which exposes the desired behaviour
directly.
Usecase for all of this: socket services which when overloaded with
connections should just slow down reception of it, but not fail
persistently.