IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We nowadays support unprivileged invocation of systemd-nspawn +
systemd-vmspawn, but there was no support for discovering suitable disk
images (i.e. no per-user counterpart of /var/lib/machines). Add this
now, and hook it up everywhere.
Instead of hardcoding machined's, importd's, portabled's, sysupdated's
image discovery to RUNTIME_SCOPE_SYSTEM I introduced a field that make
the scope variable, even if this field is always initialized to
RUNTIME_SCOPE_SYSTEM for now. I think these four services should
eventually be updated to support a per-user concept too, this is
preparation for that, even though it doesn't outright add support for
this.
This is for the largest part not user visible, except for in nspawn,
vmspawn and the dissect tool. For the latter I added a pair of
--user/--system switches to select the discovery scope.
Introduce the `systemd.break=` kernel command line option to allow stopping the
boot process at a certain point and spawn a debug shell. After exiting this
shell, the system will resume booting.
It accepts the following values:
- `pre-udev`: before starting to process kernel uevents (initrd and host).
- `pre-basic`: before leaving early boot and regular services start (initrd and
host).
- `pre-mount`: before the root filesystem is mounted (initrd).
- `pre-switch-root`: before switching root (initrd).
In the rootfs these need to run after /var/lib/ has been set up. In the
initrd we want them to run as soon as possible so that they can be used
to customize setting up the rootfs.
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.
This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
This commit add the `-i` option to `udevadm trigger` that force it to
match parent devices even if they're excluded from filters.
The rationale is that some embedded devices have a huge number of
platform devices ( ~ 4k for MX8 ) they are there because they're defined
in the device tree but there isn't any action or udev rules associated
with them.
So at boot a significant time is spend triggering and processing rules
for devices that don't produce any effect and we would like to filter
them by calling:
```
udevadm trigger --type=device --action=add -s block -s tty
```
instead of the normal
```
udevadm trigger --type=device --action=add
```
so we can use filter to filter out only subsystems for we we know that
we have rules in place that do something useful.
On the other side action / rules are not triggered until the parent is
triggered ( which is part of another subsystem), so the additional option
will allows udev to complete the coldplug with only the devices we care.
Example on iMX8:
.Without the new option
```
root@dev:~# udevadm trigger --dry-run -s block --action=add -v
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p3
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p4
```
.With the new option
```
root@dev:~# udevadm trigger --dry-run -i -s block --action=add -v
/sys/devices/platform
/sys/devices/platform/bus@5b000000
/sys/devices/platform/bus@5b000000/5b010000.mmc
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p3
/sys/devices/platform/bus@5b000000/5b010000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p4
```
Boot time reduction with this is place is ~ 1 second.
Right now, the sending of wall messages on reboot/shutdown/etc can be
controlled via DBus properties. This patch adds support for changing the
default via the logind.conf file as well.
Note that the DBus setting is lost if logind is restarted or reloaded,
but it was already the case before this patch that the setting is lost
upon restart.
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.
Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.
In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.
Fixes: #35168Fixes: #35425
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.
The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.
However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.
Fixes: #35425
This way, users don't have to check those features using an external
program, or wait for later failure when trying to enroll using an
unsupported feature.
E.g.:
```
# systemd-cryptenroll --fido2-device list
PATH MANUFACTURER PRODUCT RK CLIENTPIN UP UV
/dev/hidraw2 Yubico YubiKey OTP+FIDO+CCID yes no yes no
```
This introduces a new unit condition check: that matches if a specific
kmod module is allowed. This should be generally useful, but there's one
usecase in particular: we can optimize modprobe@.service with this and
avoid forking out a bunch of modprobe requests during boot for the same
kmods.
Checking if a kernel module is loaded is more complicated than just
checking if /sys/module/$MODULE/ exists, since kernel modules typically
take a while to initialize and we must check that this is complete (by
checking if the sysfs attr "initstate" is "live").
Document the fact that read-only properties may not have the flag
SD_BUS_VTABLE_UNPRIVILEGED as that is not obvious especially given the
flag is accepted for writable properties.
Based on the check in `add_object_vtable_internal` called by
`sd_bus_add_object_vtable` (as of the current tip of the main branch
f7f5ba0192):
case _SD_BUS_VTABLE_PROPERTY: {
[...]
if ([...] ||
[...]
(v->flags & SD_BUS_VTABLE_UNPRIVILEGED && v->type == _SD_BUS_VTABLE_PROPERTY)) {
r = -EINVAL;
goto fail;
}
(where `_SD_BUS_VTABLE_PROPERTY` means read-only property whereas
`_SD_BUS_VTABLE_WRITABLE_PROPERTY` maps to writable property).
This was implemented in the commit
adacb9575a ("bus: introduce "trusted" bus
concept and encode access control in object vtables") where
`SD_BUS_VTABLE_UNPRIVILEGED` was introduced:
Writable properties are also subject to SD_BUS_VTABLE_UNPRIVILEGED
and SD_BUS_VTABLE_CAPABILITY() for controlling write access to them.
Note however that read access is unrestricted, as PropertiesChanged
messages might send out the values anyway as an unrestricted
broadcast.
This PR allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host. This is useful for OS-like
containers running as units where they should have freedom to change
their container hostname if they want, but not the host's hostname.
Fixes: #30348
RestrictNamespaces= would accept "time" but would not actually apply
seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true
should fail but it succeeded.
This commit actually enables time namespace seccomp filtering.
We'd silently skip devices which don't have the feature in the list.
This looked wrong esp. if no devices were suitable. Instead, list them
and show which ones are usable.
$ build/systemd-cryptenroll --fido2-device=list
PATH MANUFACTURER PRODUCT HMAC SECRET
/dev/hidraw7 Yubico YubiKey OTP+FIDO+CCID ✓
/dev/hidraw10 Yubico Security Key by Yubico ✗
/dev/hidraw5 Yubico Security Key by Yubico ✗
/dev/hidraw9 Yubico Yubikey 4 OTP+U2F+CCID ✗
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.
Fixes: #30348
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.
In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Fixes: #35168
In the initrd we want to run as early as possible, before
any of the filesystems are set up, so that users can use
sysexts to customize kernel modules, firmware, etc. But
in the root fs it needs to run after /var/ has been set
up. Split the unit, and have an initrd-specific one that
runs very early.
In the initrd we want to run as early as possible, before
any of the filesystems are set up, so that users can use
confexts to customize fstab/veritytab/crypttab/etc. But
in the root fs it needs to run after /var/ has been set
up. Split the unit, and have an initrd-specific one that
runs very early.