IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
let's make sure we collect the right error code from errno, otherwise
we'll see EPERM (i.e. error 1) for all errors readv() returns (since it
returns -1 on error), including EAGAIN.
This is definitely backport material.
A fix-up for 3691bcf3c5eebdcca5b4f1c51c745441c57a6cd1.
Fixes: #16699
The concept is flawed, and mostly useless. Let's finally remove it.
It has been deprecated since 90a2ec10f2d43a8530aae856013518eb567c4039 (6
years ago) and we started to warn since
55dadc5c57ef1379dbc984938d124508a454be55 (1.5 years ago).
Let's get rid of it altogether.
Previously, we'd create them from user-runtime-dir@.service. That has
one benefit: since this service runs privileged, we can create the full
set of device nodes. It has one major drawback though: it security-wise
problematic to create files/directories in directories as privileged
user in directories owned by unprivileged users, since they can use
symlinks to redirect what we want to do. As a general rule we hence
avoid this logic: only unpriv code should populate unpriv directories.
Hence, let's move this code to an appropriate place in the service
manager. This means we lose the inaccessible block device node, but
since there's already a fallback in place, this shouldn't be too bad.
This has the major benefit that the entire payload of the container can
access these files there. Previously, we'd set them only as env vars,
but that meant only PID 1 could read them directly or other privileged
payload code with access to /run/1/environ.
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
The sd_notify() socket that nspawn binds that the payload can use to
talk to it was previously stored in /run/systemd/nspawn/notify, which is
weird (as in the previous commit) since this makes /run/systemd
something that is cooperatively maintained by systemd inside the
container and nspawn outside of it.
We now have a better place where container managers can put the stuff
they want to pass to the payload: /run/host/, hence let's make use of
that.
This is not a compat breakage, since the sd_notify() protocol is based
on the $NOTIFY_SOCKET env var, where we place the new socket path.
Previously we'd use a directory /run/systemd/nspawn/incoming for
accepting mounts to propagate from the host. This is a bit weird, since
we have a shared namespace: /run/systemd/ contains both stuff managed by
the surround nspawn as well as from the systemd inside.
We now have the /run/host/ hierarchy that has special stuff we want to
pass from host to container. Let's make use of that here, and move this
directory here too.
This is not a compat breakage, since the payload never interfaces with
that directory natively: it's only nspawn and machined that need to
agree on it.
arg_system == true and getpid() == 1 hold under the very same condition
this early in the main() function (this only changes later when we start
parsing command lines, where arg_system = true is set if users invoke us
in test mode even when getpid() != 1.
Hence, let's simplify things, and merge a couple of if branches and not
pretend they were orthogonal.
Timestamps for unit start/stop are recorded with microsecond granularity,
but status and show truncate to second granularity by default.
Add a --timestamp=pretty|us|utc option to allow including the microseconds
or to use the UTC TZ to all timestamps printed by systemctl.
Apparently both Fedora and suse default to btrfs now, it should hence be
good enough for us too.
This enables a bunch of really nice things for us, most importanly we
can resize home directories freely (i.e. both grow *and* shrink) while
online. It also allows us to add nice subvolume based home directory
snapshotting later on.
Also, whenever we mention the three supported types, alaways mention
them in alphabetical order, which is also our new order of preference.
Some masks shouldn't be needed externally, so keep their functions in
the module (others would fit there too but they're used in tests) to
think twice if something would depend on them.
Drop unused function cg_attach_many_everywhere.
Use cgroup_realized instead of cgroup_path when we actually ask for
realized.
This should not cause any functional changes.
The usage in unit_get_own_mask is redundant, we only need apply
disable_mask at the end befor application, i.e. calculating enable or
target mask.
(IOW, we allow all configurations, but disabling affects effective
controls.)
Modify tests accordingly and add testing of enable mask.
This is intended as cleanup, with no effect but changing unit_dump
output.
The unit_add_siblings_to_cgroup_realize_queue does more than mere
siblings queueing, hence define a family of a unit as (immediate)
children of the unit and immediate children of all ancestors.
Working with this abstraction simplifies the queuing calls and it
shouldn't change the functionality.
Merge members mask invalidation into
unit_add_siblings_to_cgroup_realize_queue, this way unit_realize_cgroup
needn't be called with members mask invalidation.
We have to retain the members mask invalidation in unit_load -- although
active units would have cgroups (re)realized (unit_load queues for
realization), the realization would happen with potentially stale mask.
unit_free(u) realizes direct parent and invalidates members mask of all
ancestors. This isn't sufficient in v1 controller hierarchies since
siblings of the freed unit may have existed only because of the removed
unit.
We cannot be lazy about the siblings because if parent(u) is also
removed, it'd migrate and rmdir cgroups for siblings(u). However,
realized masks of siblings(u) won't reflect this change.
This was a non-issue earlier, because we weren't removing cgroup
directories properly (effectively matching the stale realized mask),
removal failed because of tasks left by missing migration (see previous
commit).
Therefore, ensure realization of all units necessary to clean up after
the free'd unit.
Fixes: #14149
When we are about to derealize a controller on v1 cgroup, we first
attempt to delete the controller cgroup and migrate afterwards. This
doesn't work in practice because populated cgroup cannot be deleted.
Furthermore, we leave out slices from migration completely, so
(un)setting a control value on them won't realize their controller
cgroup.
Rework actual realization, unit_create_cgroup() becomes
unit_update_cgroup() and make sure that controller hierarchies are
reduced when given controller cgroup ceased to be needed.
Note that with this we introduce slight deviation between v1 and v2 code
-- when a descendant unit turns off a delegated controller, we attempt
to disable it in ancestor slices. On v2 this may fail (kernel enforced,
because of child cgroups using the controller), on v1 we'll migrate
whole subtree and trim the subhierachy. (Previously, we wouldn't take
away delegated controller, however, derealization was broken anyway.)
Fixes: #14149
The description didn't really explain how the distribution mechanism
works exactly and the relationship of leaf and slice units.
Update the documentation and also explicitly explain the expected
behaviour as it is created by the memory_recursiveprot cgroup2 mount
option.
When available, enable memory_recursiveprot. Realistically it always
makes sense to delegate MemoryLow= and MemoryMin= to all children of a
slice/unit.
The kernel option is not enabled by default as it might cause
regressions in some setups. However, it is the better default in
general, and it results in a more flexible and obvious behaviour.
The alternative to using this option would be for user's to also set
DefaultMemoryLow= on slices when assigning MemoryLow=. However, this
makes the effect of MemoryLow= on some children less obvious, as it
could result in a lower protection rather than increasing it.
From the kernel documentation:
memory_recursiveprot
Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
This was added in kernel commit 8a931f801340c (mm: memcontrol:
recursive memory.low protection), which became available in 5.7 and was
subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash
race condition in memory.low).
It is possible that we will be running with an upgraded libseccomp, in which
case libseccomp might know the syscall name, even if the number is not known at
the time when systemd is being compiled. The guard only serves to break such
upgrades, by requiring that we also recompile systemd.
For s390-specific syscalls, use a define to exclude them, so that that we don't
try to filter them on other arches.
(cherry picked from commit 6cf852e79eb0eced2f77653941f9c75c3bd79386)
Was trying to run src/partition/test-repart.sh on CentOS 8 and the first
resize call kept failing with ERANGE. Turned out that CentOS 8 comes
with libfdisk-devel-2.32.1 which is missing
2f35c1ead6
(in libfdisk 2.33 and up).