Bugfixes: * Many manager configuration settings that are only applicable to user manager or system manager can be always set. It would be better to reject them when parsing config. * Jun 01 09:43:02 krowka systemd[1]: Unit user@1000.service has alias user@.service. Jun 01 09:43:02 krowka systemd[1]: Unit user@6.service has alias user@.service. Jun 01 09:43:02 krowka systemd[1]: Unit user-runtime-dir@6.service has alias user-runtime-dir@.service. External: * Fedora: add an rpmlint check that verifies that all unit files in the RPM are listed in %systemd_post macros. * dbus: - natively watch for dbus-*.service symlinks (PENDING) - teach dbus to activate all services it finds in /etc/systemd/services/org-*.service * fedora: suggest auto-restart on failure, but not on success and not on coredump. also, ask people to think about changing the start limit logic. Also point people to RestartPreventExitStatus=, SuccessExitStatus= * neither pkexec nor sudo initialize environ[] from the PAM environment? * fedora: update policy to declare access mode and ownership of unit files to root:root 0644, and add an rpmlint check for it * register catalog database signature as file magic * zsh shell completion: - - should complete options, but currently does not - systemctl add-wants,add-requires - systemctl reboot --boot-loader-entry= * systemctl status should know about 'systemd-analyze calendar ... --iterations=' * If timer has just OnInactiveSec=..., it should fire after a specified time after being started. * write blog stories about: - hwdb: what belongs into it, lsusb - enabling dbus services - how to make changes to sysctl and sysfs attributes - remote access - how to pass throw-away units to systemd, or dynamically change properties of existing units - auto-restart - how to develop against journal browsing APIs - the journal HTTP iface - non-cgroup resource management - dynamic resource management with cgroups - refreshed, longer missions statement - calendar time events - init=/bin/sh vs. "emergency" mode, vs. "rescue" mode, vs. "multi-user" mode, vs. "graphical" mode, and the debug shell - how to create your own target - instantiated apache, dovecot and so on - hooking a script into various stages of shutdown/early boot Regularly: * look for close() vs. close_nointr() vs. close_nointr_nofail() * check for strerror(r) instead of strerror(-r) * pahole * set_put(), hashmap_put() return values check. i.e. == 0 does not free()! * use secure_getenv() instead of getenv() where appropriate * link up selected blog stories from man pages and unit files Documentation= fields Janitorial Clean-ups: * rework mount.c and swap.c to follow proper state enumeration/deserialization semantics, like we do for device.c now * get rid of prefix_roota() and similar, only use chase() and related calls instead. * get rid of basename() and replace by path_extract_filename() * Replace our fstype_is_network() with a call to libmount's mnt_fstype_is_netfs()? Having two lists is not nice, but maybe it's now worth making a dependency on libmount for something so trivial. * drop set_free_free() and switch things over from string_hash_ops to string_hash_ops_free everywhere, so that destruction is implicit rather than explicit. Similar, for other special hashmap/set/ordered_hashmap destructors. * generators sometimes apply C escaping and sometimes specifier escaping to paths and similar strings they write out. Sometimes both. We should clean this up, and should probably always apply both, i.e. introduce unit_file_escape() or so, which applies both. * xopenat() should pin the parent dir of the inode it creates before doing its thing, so that it can create, open, label somewhat atomically. Deprecations and removals: * Remove any support for booting without /usr pre-mounted in the initrd entirely. Update INITRD_INTERFACE.md accordingly. * remove cgroups v1 support EOY 2023. As per https://lists.freedesktop.org/archives/systemd-devel/2022-July/048120.html and then rework cgroupsv2 support around fds, i.e. keep one fd per active unit around, and always operate on that, instead of cgroup fs paths. * drop support for getrandom()-less kernels. (GRND_INSECURE means once kernel 5.6 becomes our baseline). See https://github.com/systemd/systemd/pull/24101#issuecomment-1193966468 for details. Maybe before that: at taint-flags/warn about kernels that lack getrandom()/environments where it is blocked. * drop support for LOOP_CONFIGURE-less loopback block devices, once kernel baseline is 5.8. * drop fd_is_mount_point() fallback mess once we can rely on STATX_ATTR_MOUNT_ROOT to exist i.e. kernel baseline 5.8 * Remove /dev/mem ACPI FPDT parsing when /sys/firmware/acpi/fpdt is ubiquitous. That requires distros to enable CONFIG_ACPI_FPDT, and have kernels v5.12 for x86 and v6.2 for arm. * Once baseline is 4.13, remove support for INTERFACE_OLD= checks in "udevadm trigger"'s waiting logic, since we can then rely on uuid-tagged uevents Features: * resolved: make resolved process DNR DHCP info * Teach systemd-ssh-generator to generated an /run/issue.d/ drop-in telling users how to connect to the system via the AF_VSOCK, as per: https://github.com/systemd/systemd/issues/35071#issuecomment-2462803142 * maybe introduce an OSC sequence that signals when we ask for a password, so that terminal emulators can maybe connect a password manager or so, and highlight things specially. * Port pidref_namespace_open() to use PIDFD_GET_MNT_NAMESPACE and related ioctls to get nsfds directly from pidfds. * start using STATX_SUBVOL in btrfs_is_subvol(). Also, make use of it generically, so that image discovery recognizes bcachefs subvols too. * format-table: introduce new cell type for strings with ansi sequences in them. display them in regular output mode (via strip_tab_ansi()), but suppress them in json mode. * machined: when registering a machine, also take a relative cgroup path, relative to the machine's unit. This is useful when registering unpriv machines, as they might sit down the cgroup tree, below a cgroup delegation boundary. Then, install an inotify watch on that cgroup to track when the machine's local cgroup goes down. * resolved: report ttl in resolution replies if we know it. This data is useful for tools such as wireguard which want to periodically re-resolve DNS names, and might want to use the TTL has hint for that. * journald: beef up ClientContext logic to store pidfd_id of peer, to validate we really use the right cache entry * journald: log client's pidfd id as a new automatic field _PIDFDID= or so. * journald: split up ClientContext cache in two: one cache keyed by pid/pidfdid with process information, and another one keyed by cgroup path/cgroupid with cgroup information. This way if a service consisting of many logging processes can take benefit of the cgroup caching. * system lsmbpf policy that prohibits creating files owned by "nobody" system-wide * system lsmpbf policy that prohibits creating or opening device nodes outside of devtmpfs/tmpfs, except if they are the pseudo-devices /dev/null, /dev/zero, /dev/urandom and so on. * system lsmbpf policy that enforces that block device backed mounts may only be established on top of dm-crypt or dm-verity devices, or an allowlist of file systems (which should probably include vfat, for compat with the ESP) * $LISTEN_PID, $MAINPID and $SYSTEMD_EXECPID env vars that the service manager sets should be augmented with $LISTEN_PIDFDID, $MAINPIDFDID and $SYSTEMD_EXECPIDFD (and similar for other env vars we might send). * port copy.c over to use LabelOps for all labelling. * port remaining getmntent() users over to libmount. There are subtle differences in the parsers (see #25371 for example), and it hence makes sense if we stick to one set of parsers on this, not mix both. * run0 and run0 --user=root have different effect on tty ownership? * get rid of compat with libidn.so.11 (retain only for libidn.so.12) * get rid of compat with libbpf.so.0 (retainly only for libbpf.so.1) * define a generic "report" varlink interface, which services can implement to provide health/statistics data about themselves. then define a dir somewhere in /run/ where components can bind such sockets. Then make journald, logind, and pid1 itself implement this and expose various stats on things there. Then issue parallel calls to these interfaces from the systemd-report tool, combine into one json document, and include measurement logs and tpm quote. tpm quote should protect the json doc via the nonce field studd. Allow shipping this off elsewhere for analyze. * The bind(AF_UNSPEC) construct (for resetting sockets to their initial state) should be blocked in many cases because it punches holes in many sandboxes. * find a nice way to opt-in into auto-masking SIGCHLD on first sd_event_add_child(), and then get rid of many more explicit sigprocmask() calls. * introduce new structure Tpm2CombinedPolicy, that combines the various TPm2 policy bits into one structure, i.e. public key info, pcr masks, pcrlock stuff, pin and so on. Then pass that around in tpm2_seal() and tpm2_unseal(). * look at nsresourced, mountfsd, homed, importd, and try to come up with a way how the forked off worker processes can be moved into transient services with sandboxing, without breaking notify socket stuff and so on. * replace all \x1b, \x1B, \033 C string escape sequences in our codebase with a more readable \e. It's a GNU extension, but a ton more readable than the others, and most importantly it doesn't result in confusing errors if you suffix the escape sequence with one more decimal digit, because compilers think you might actually specify a value outside the 8bit range with that. * homed: allow login via username + realm on getty/login prompt. Then rewrite the user name in the PAM stack * homed/userdb: add "aliases" field to user record, which can alternatively be used for logging in. Rewrite user name in the PAM stack once acquired. * confext/sysext: instead of mounting the overlayfs directly on /etc/ + /usr/, insert an intermediary bind mount on itself there. This has the benefit that services where mount propagation from the root fs is off, an still have confext/sysext propagated in. * generic interface for varlink for setting log level and stuff that all our daemons can implement * maybe teach repart.d/ dropins a new setting MakeMountNodes= or so, which is just like MakeDirectories=, but uses an access mode of 0000 and sets the +i chattr bit. This is useful as protection against early uses of /var/ or /tmp/ before their contents is mounted. * go through all uses of table_new() in our codebase, and make sure we support all three of: 1. --no-legend properly 2. --json= properly 3. --no-pager properly * go through all --help texts in our codebases, and make sure: 1. the one sentence description of the tool is highlighted via ANSI how we usually do it 2. If more than one or two commands are supported (as opposed to switches), separate commands + switches from each other, using underlined --help sections. 3. If there are many switches, consider adding additional --help sections. * go through our codebase, and convert "vertical tables" (i.e. things such as "systemctl status") to use table_new_vertical() for output * pcrlock: add support for multi-profile UKIs * logind: when logging in use new tmpfs quota support to configure quota on /tmp/ + /dev/shm/. But do so only in case of tmpfs, because otherwise quota is persistent and any persistent settings mean we don#t have to reapply them. * initrd: when transitioning from initrd to host, validate that /lib/modules/`uname -r` exists, refuse otherwise * signed bpf loading: to address need for signature verification for bpf programs when they are loaded, and given the bpf folks don't think this is realistic in kernel space, maybe add small daemon that facilitates this loading on request of clients, validates signatures and then loads the programs. This daemon should be the only daemon with privs to do load BPF on the system. It might be a good idea to run this daemon already in the initrd, and leave it around during the initrd transition, to continue serve requests. Should then live in its own fs namespace that inherits from the initrd's fs tree, not from the host, to isolate it properly. Should set PR_SET_DUMPABLE so that it cannot be ptraced from the host. Should have CAP_SYS_BPF as only service around. * add a mechanism we can drop capabilities from pid1 *before* transitioning from initrd to host. i.e. before we transition into the slightly lower trust domain that is the host systems we might want to get rid of some caps. Example: CAP_SYS_BPF in the signed bpf loading logic above. (We already have CapabilityBoundingSet= in system.conf, but that is enforced when pid 1 initializes, rather then when it transitions to the next.) * maybe add a new standard slice where process that are started in the initrd and stick around for the whole system runtime (i.e. root fs storage daemons, the bpf loader daemon discussed above, and such) are placed. maybe protected.slice or so? Then write docs that suggest that services like this set Slice=protected.sice, RefuseManualStart=yes, RefuseManualStop=yes and a couple of other things. * add feature to xopenat() that implements O_REGULAR in userspace: i.e. let's open the inode via O_PATH first, then validate its type, and then convert to proper fd via fd_reopen() * rough proposed implementation design for remote attestation infra: add a tool that generates a quote of local PCRs and NvPCRs, along with synchronous log snapshot. use "audit session" logic for that, so that we get read-outs and signature in one step. Then turn this into a JSON object. Use the "TCG TSS 2.0 JSON Data Types and Policy Language" format to encode the signature. And CEL for the measurement log. * creds: add a new cred format that reused the JSON structures we use in the LUKS header, so that we get the various newer policies for free. * drop PCR 7 from default PCR mask in credentials and LUKS2 enrollments * systemd-analyze: port "pcrs" verb to talk directly to TPM device, instead of using sysfs interface (well, or maybe not, as that would require privileges?) * pcrextend/tpm2-util: add a concept of "rotation" to event log. i.e. allow trailing parts of the logs if time or disk space limit is hit. Protect the boot-time measurements however (i.e. up to some point where things are settled), since we need those for pcrlock measurements and similar. When deleting entries for rotation, place an event that declares how many items have been dropped, and what the hash before and after that. * measure information about all DDIs as we activate them to an NvPCR. We probably should measure the dm-verity root hash from the kernel side, but DDI meta info from userspace. * rework tpm2_parse_pcr_argument_to_mask() to refuse literal hash value specifications. They are currently parsed but ignored. We should refuse them however, to not confuse people. * use name_to_handle_at() with AT_HANDLE_FID instead of .st_ino (inode number) for identifying inodes, for example in copy.c when finding hard links, or loop-util.c for tracking backing files, and other places. * cryptenroll/cryptsetup/homed: add unlock mechanism that combines tpm2 and fido2, as well as tpm2 + ssh-agent, inspired by ChromeOS' logic: encrypt the volume key with the TPM, with a policy that insists that a nonce is signed by the fido2 device's key or ssh-agent key. Thus, add unlock/login time the TPM generates a nonce, which is sent as a challenge to the fido2/ssh-agent, which returns a signature which is handed to the tpm, which then reveals the volume key to the PC. * cryptenroll/cryptsetup/homed: similar to this, implement TOTP backed by TPM. * expose the handoff timestamp fully via the D-Bus properties that contain ExecStatus information * properly serialize the ExecStatus data from all ExecCommand objects associated with services, sockets, mounts and swaps. Currently, the data is flushed out on reload, which is quite a limitation. * Clean up "reboot argument" handling, i.e. set it through some IPC service instead of directly via /run/, so that it can be sensible set remotely. * userdb: add concept for user "aliases", to cover for cases where you can log in under the name lennart@somenetworkfsserver, and it would automatically generate a local user, and from the one both names can be used to allow logins into the same account. * systemd-tpm2-support: add a some logic that detects if system is in DA lockout mode, and queries the user for TPM recovery PIN then. * systemd-repart should probably enable btrfs' "temp_fsid" feature for all file systems it creates, as we have no interest in RAID for repart, and it should make sure that we can mount them trivially everywhere. * systemd-nspawn should get the same SSH key support that vmspawn now has. * move documentation about our common env vars (SYSTEMD_LOG_LEVEL, SYSTEMD_PAGER, …) into a man page of its own, and just link it from our various man pages that so far embed the whole list again and again, in an attempt to reduce clutter and noise a bid. * vmspawn switch default swtpm PCR bank to SHA384-only (away from SHA256), at least on 64bit archs, simply because SHA384 is typically double the hashing speed than SHA256 on 64bit archs (since based on 64bit words unlike SHA256 which uses 32bit words). * In vmspawn/nspawn/machined wait for X_SYSTEMD_UNIT_ACTIVE=ssh-active.target and X_SYSTEMD_SIGNALS_LEVEL=2 as indication whether/when SSH and the POSIX signals are available. Similar for D-Bus (but just use sockets.target for that). Report as property for the machine. * teach nspawn/machined a new bus call/verb that gets you a shell in containers that have no sensible pid1, via joining the container, and invoking a shell directly. Then provide another new bus call/vern that is somewhat automatic: if we detect that pid1 is running and fully booted up we provide a proper login shell, otherwise just a joined shell. Then expose that as primary way into the container. * make vmspawn/nspawn/importd/machined a bit more usable in a WSL-like fashion. i.e. teach unpriv systemd-vmspawn/systemd-nspawn a reasonable --bind-user= behaviour that mounts the calling user through into the machine. Then, ship importd with a small database of well known distro images along with their pinned signature keys. Then add some minimal glue that binds this together: downloads a suitable image if not done so yet, starts it in the bg via vmspawn/nspawn if not done so yet and then requests a shell inside it for the invoking user. * importd/…: define per-user dirs for container/VM images too. * add a new specifier to unit files that figures out the DDI the unit file is from, tracing through overlayfs, DM, loopback block device. * importd/importctl - port tar handling to libarchive - complete varlink interface - download images into .v/ dirs * in os-release define a field that can be initialized at build time from SOURCE_DATE_EPOCH (maybe even under that name?). Would then be used to initialize the timestamp logic of ConditionNeedsUpdate=. * nspawn/vmspawn/pid1: add ability to easily insert fully booted VMs/FOSC into shell pipelines, i.e. add easy to use switch that turns off console status output, and generates the right credentials for systemd-run-generator so that a program is invoked, and its output captured, with correct EOF handling and exit code propagation * new systemd-analyze "join" verb or so, for debugging services. Would be nsenter on steroids, i.e invoke a shell or command line in an environment as close as we can make it for the MainPID of a service. Should be built around pidfd, so that we can reasonably robustly do this. Would only cover the execution environment like namespaces, but not the privilege settings. * Introduce a CGroupRef structure, inspired by PidRef. Should contain cgroup path, cgroup id, and cgroup fd. Use it to continuously pin all v2 cgroups via a cgroup_ref field in the CGroupRuntime structure. Eventually switch things over to do all cgroupfs access only via that structure's fd. * Get rid of the symlinks in /run/systemd/units/* and exclusively use cgroupfs xattrs to convey info about invocation ids, logging settings and so on. support for cgroupfs xattrs in the "trusted." namespace was added in linux 3.7, i.e. which we don't pretend to support anymore. * rewrite bpf-devices in libbpf/C code, rather than home-grown BPF assembly, to match bpf-restrict-fs, bpf-restrict-ifaces, bpf-socket-bind * ditto: rewrite bpf-firewall in libbpf/C code * credentials: if we ever acquire a secure way to derive cgroup id of socket peers (i.e. SO_PEERCGROUPID), then extend the "scoped" credential logic to allow cgroup-scoped (i.e. app or service scoped) credentials. Then, as next step use this to implement per-app/per-service encrypted directories, where we set up fscrypt on the StateDirectory= with a randomized key which is stored as xattr on the directory, encrypted as a credential. * credentials: optionally include a per-user secret in scoped user-credential encryption keys. should come from homed in some way, derived from the luks volume key or fscrypt directory key. * credentials: add a flag to the scoped credentials that if set require PK reauthentication when unlocking a secret. * teach systemd --user to properly load credentials off disk, with /etc/credstore equivalent and similar. Make sure that $CREDENTIALS_DIRECTORY= actually works too when run with user privs. * extend the smbios11 logic for passing credentials so that instead of passing the credential data literally it can also just reference an AF_VSOCK CID/port to read them from. This way the data doesn't remain in the SMBIOS blob during runtime, but only in the credentials fs. * machined: optionally track nspawn unix-export/ runtime for each machined, and then update systemd-ssh-proxy so that it can connect to that. * add a new ExecStart= flag that inserts the configured user's shell as first word in the command line. (maybe use character '.'). Usecase: tool such as run0 can use that to spawn the target user's default shell. * introduce mntid_t, and make it 64bit, as apparently the kernel switched to 64bit mount ids * mountfsd/nsresourced - userdb: maybe allow callers to map one uid to their own uid - bpflsm: allow writes if resulting UID on disk would be userns' owner UID - make encrypted DDIs work (password…) - add API for creating a new file system from scratch (together with some dm-integrity/HMAC key). Should probably work using systemd-repart (access via varlink). - add api to make an existing file "trusted" via dm-integry/HMAC key - port: portabled - port: tmpfiles, sysusers and similar - lets see if we can make runtime bind mounts into unpriv nspawn work * add a kernel cmdline switch (and cred?) for marking a system to be "headless", in which case we never open /dev/console for reading, only for writing. This would then mean: systemd-firstboot would process creds but not ask interactively, getty would not be started and so on. * cryptsetup: new crypttab option to auto-grow a luks device to its backing partition size. new crypttab option to reencrypt a luks device with a new volume key. * we probably should have some infrastructure to acquire sysexts with drivers/firmware for local hardware automatically. Idea: reuse the modalias logic of the kernel for this: make the main OS image install a hwdb file that matches against local modalias strings, and adds properties to relevant devices listing names of sysexts needed to support the hw. Then provide some tool that goes through all devices and tries to acquire/download the specified images. * repart + cryptsetup: support file systems that are encrypted and use verity on top. Usecase: confexts that shall be signed by the admin but also be confidential. Then, add a new --make-ddi=confext-encrypted for this. * tmpfiles: add new line type for moving files from some source dir to some target dir. then use that to move sysexts/confexts and stuff from initrd tmpfs to /run/, so that host can pick things up. * tiny varlink service that takes a fd passed in and serves it via http. Then make use of that in networkd, and expose some EFI binary of choice for DHCP/HTTP base EFI boot. * bootctl: add reboot-to-disk which takes a block device name, and automatically sets things up so that system reboots into that device next. * maybe: in PID1, when we detect we run in an initrd, make superblock read-only early on, but provide opt-out via kernel cmdline. * systemd-pcrextend: - support measuring to nvindex with PCR update semantics ("fake PCRs") - add api for "allocating" such an nvindex - once we have that start measuring every sysext we apply, every confext, every RootImage= we apply, every nspawn and so on. All in separate fake PCRs. * vmspawn: - run in scope unit when invoked from command line, and machined registration is off - sd_notify support - --ephemeral support - --read-only support - automatically suspend/resume the VM if the host suspends. Use logind suspend inhibitor to implement this. request clean suspend by generating suspend key presses. - support for "real" networking via "-n" and --network-bridge= - translate SIGTERM to clean ACPI shutdown event * systemd-pcrmachine should probably also measure the SMBIOS system UUID. * sd-boot: allow synthesizing additional type1 entries via SMBIOS vendor strings * storagetm: - add USB mass storage device logic, so that all local disks are also exposed as mass storage devices on systems that have a USB controller that can operate in device mode - add NVMe authentication * add support for activating nvme-oF devices at boot automatically via kernel cmdline, and maybe even support a syntax such as root=nvme::::: to boot directly from nvme-oF * pcrlock: - add kernel-install plugin that automatically creates UKI .pcrlock file when UKI is installed, and removes it when it is removed again - automatically install PE measurement of sd-boot on "bootctl install" - pre-calc sysext + kernel cmdline measurements - pre-calc cryptsetup root key measurement - maybe make systemd-repart generate .pcrlock for old and new GPT header in /run? - Add support for more than 8 branches per PCR OR - add "systemd-pcrlock lock-kernel-current" or so which synthesizes .pcrlock policy from currently booted kernel/event log, to close gap for first boot for pre-built images * in sd-boot and sd-stub measure the SMBIOS vendor strings to some PCR (at least some subset of them that look like systemd stuff), because apparently some firmware does not, but systemd honours it. avoid duplicate measurement by sd-boot and sd-stub by adding LoaderFeatures/StubFeatures flag for this, so that sd-stub can avoid it if sd-boot already did it. * cryptsetup: a mechanism that allows signing a volume key with some key that has to be present in the kernel keyring, or similar, to ensure that confext DDIs can be encrypted against the local SRK but signed with the admin's key and thus can authenticated locally before they are decrypted. * image policy should be extended to allow dictating *how* a disk is unlocked, i.e. root=encrypted-tpm2+encrypted-fido2 would mean "root fs must be encrypted and unlocked via fido2 or tpm2, but not otherwise" * systemd-repart: add support for formatting dm-crypt + dm-integrity file systems. * homed: use systemd-storagetm to expose home dirs via nvme-tcp. Then, teach homed/pam_systemd_homed with a user name such as lennart%nvme_tcp_192.168.100.77_8787 to log in from any linux host with the same home dir. Similar maybe for nbd, iscsi? this should then first ask for the local root pw, to authenticate that logging in like this is ok, and would then be followed by another password prompt asking for the user's own password. Also, do something similar for CIFS: if you log in via lennart%cifs-someserver_someshare, then set up the homed dir for it automatically. The PAM module should update the user name used for login to the short version once it set up the user. Some care should be taken, so that the long version can be still be resolved via NSS afterwards, to deal with PAM clients that do not support PAM sessions where PAM_USER changes half-way. * redefine /var/lib/extensions/ as the dir one can place all three of sysext, confext as well is multi-modal DDIs that qualify as both. Then introduce /var/lib/sysexts/ which can be used to place only DDIs that shall be used as sysext * Varlinkification of the following command line tools, to open them up to other programs via IPC: - bootctl - journalctl (allowing journal read access via IPC) - coredumpcl - systemd-bless-boot - systemd-measure - systemd-cryptenroll (to allow UIs to enroll FIDO2 keys and such) - systemd-dissect - systemd-sysupdate - systemd-analyze - kernel-install - systemd-mount (with PK so that desktop environments could use it to mount disks) * enumerate virtiofs devices during boot-up in a generator, and synthesize mounts for rootfs, /usr/, /home/, /srv/ and some others from it, depending on the "tag". (waits for: https://gitlab.com/virtio-fs/virtiofsd/-/issues/128) * automatically mount one virtiofs during early boot phase to /run/host/, similar to how we do that for nspawn, based on some clear tag. * add some service that makes an atomic snapshot of PCR state and event log up to that point available, possibly even with quote by the TPM. * encode type1 entries in some UKI section to add additional entries to the menu. * Add ACL-based access management to .socket units. i.e. add AllowPeerUser= + AllowPeerGroup= that installs additional user/group ACL entries on AF_UNIX sockets. * systemd-tpm2-setup should probably have a factory reset logic, i.e. when some kernel command line option is set we reset the TPM (equivalent of tpm2_clear -c owner? or rather echo 5 >/sys/class/tpm/tpm0/ppi/request?). * systemd-tpm2-setup should support a mode where we refuse booting if the SRK changed. (Must be opt-in, to not break systems which are supposed to be migratable between PCs) * when systemd-sysext learns mutable /usr/ (and systemd-confext mutable /etc/) then allow them to store the result in a .v/ versioned subdir, for some basic snapshot logic * add a new PE binary section ".mokkeys" or so which sd-stub will insert into Mok keyring, by overriding/extending whatever shim sets in the EFI var. Benefit: we can extend the kernel module keyring at ukify time, i.e. without recompiling the kernel, taking an upstream OS' kernel and adding a local key to it. * PidRef conversion work: - cg_pid_get_xyz() - pid_from_same_root_fs() - get_ctty_devnr() - actually wait for POLLIN on pidref's pidfd in service logic - openpt_allocate_in_namespace() - unit_attach_pid_to_cgroup_via_bus() - cg_attach() – requires new kernel feature - journald's process cache * ddi must be listed as block device fstype * measure some string via pcrphase whenever we end up booting into emergency mode. * similar, measure some string via pcrphase whenever we resume from hibernate * homed: add a basic form of secrets management to homed, that stores secrets in $HOME somewhere, is protected by the accounts own authentication mechanisms. Should implement something PKCS#11-like that can be used to implement emulated FIDO2 in unpriv userspace on top (which should happen outside of homed), emulated PKCS11, and libsecrets support. Operate with a 2nd key derived from volume key of the user, with which to wrap all keys. maintain keys in kernel keyring if possible. * use sd-event ratelimit feature optionally for journal stream clients that log too much * systemd-mount should only consider modern file systems when mounting, similar to systemd-dissect * add another PE section ".fname" or so that encodes the intended filename for PE file, and validate that when loading add-ons and similar before using it. This is particularly relevant when we load multiple add-ons and want to sort them to apply them in a define order. The order should not be under control of the attacker. * also include packaging metadata (á la https://systemd.io/ELF_PACKAGE_METADATA/) in our UEFI PE binaries, using the same JSON format. * make "bootctl install" + "bootctl update" useful for installing shim too. For that introduce new dir /usr/lib/systemd/efi/extra/ which we copy mostly 1:1 into the ESP at install time. Then make the logic smart enough so that we don't overwrite bootx64.efi with our own if the extra tree already contains one. Also, follow symlinks when copying, so that shim rpm can symlink their stuff into our dir (which is safe since the target ESP is generally VFAT and thus does not have symlinks anyway). Later, teach the update logic to look at the ELF package metadata (which we also should include in all PE files, see above) for version info in all *.EFI files, and use it to only update if newer. * in sd-stub: optionally add support for a new PE section .keyring or so that contains additional certificates to include in the Mok keyring, extending what shim might have placed there. why? let's say I use "ukify" to build + sign my own fedora-based UKIs, and only enroll my personal lennart key via shim. Then, I want to include the fedora keyring in it, so that kmods work. But I might not want to enroll the fedora key in shim, because this would also mean that the key would be in effect whenever I boot an archlinux UKI built the same way, signed with the same lennart key. * resolved: take possession of some IPv6 ULA address (let's say fd00:5353:5353:5353:5353:5353:5353:5353), and listen on port 53 on it for the local stubs, so that we can make the stub available via ipv6 too. * Maybe add SwitchRootEx() as new bus call that takes env vars to set for new PID 1 as argument. When adding SwitchRootEx() we should maybe also add a flags param that allows disabling and enabling whether serialization is requested during switch root. * introduce a .acpitable section for early ACPI table override * add proper .osrel matching for PE addons. i.e. refuse applying an addon intended for a different OS. Take inspiration from how confext/sysext are matched against OS. * figure out what to do about credentials sealed to PCRs in kexec + soft-reboot scenarios. Maybe insist sealing is done additionally against some keypair in the TPM to which access is updated on each boot, for the next, or so? * logind: when logging in, always take an fd to the home dir, to keep the dir busy, so that autofs release can never happen. (this is generally a good idea, and specifically works around the fact the autofs ignores busy by mount namespaces) * mount most file systems with a restrictive uidmap. e.g. mount /usr/ with a uidmap that blocks out anything outside 0…1000 (i.e. system users) and similar. * mount the root fs with MS_NOSUID by default, and then mount /usr/ without both so that suid executables can only be placed there. Do this already in the initrd. If /usr/ is not split out create a bind mount automatically. * fix our various hwdb lookup keys to end with ":" again. The original idea was that hwdb patterns can match arbitrary fields with expressions like "*:foobar:*", to wildcard match both the start and the end of the string. This only works safely for later extensions of the string if the strings always end in a colon. This requires updating our udev rules, as well as checking if the various hwdb files are fine with that. * mount /tmp/ and /var/tmp with a uidmap applied that blocks out "nobody" user among other things such as dynamic uid ranges for containers and so on. That way no one can create files there with these uids and we enforce they are only used transiently, never persistently. * rework loopback support in fstab: when "loop" option is used, then instantiate a new systemd-loop@.service for the source path, set the lo_file_name field for it to something recognizable derived from the fstab line, and then generate a mount unit for it using a udev generated symlink based on lo_file_name. * teach systemd-nspawn the boot assessment logic: hook up vpick's try counters with success notifications from nspawn payloads. When this is enabled, automatically support reverting back to older OS version images if newer ones fail to boot. * implement new "systemd-fsrebind" tool that works like gpt-auto-generator but looks at a root dir and then applies vpick on various dirs/images to pick a root tree, a /usr/ tree, a /home/, a /srv/, a /var/ tree and so on. Dirs could also be btrfs subvols (combine with btrfs auto-snapshort approach for creating versions like these automatically). * remove tomoyo support, it's obsolete and unmaintained apparently * In .socket units, add ConnectStream=, ConnectDatagram=, ConnectSequentialPacket= that create a socket, and then *connect to* rather than listen on some socket. Then, add a new setting WriteData= that takes some base64 data that systemd will write into the socket early on. This can then be used to create connections to arbitrary services and issue requests into them, as long as the data is static. This can then be combined with the aforementioned journald subscription varlink service, to enable activation-by-message id and similar. * .service with invalid Sockets= starts successfully. * landlock: lock down RuntimeDirectory= via landlock, so that services lose ability to write anywhere else below /run/. Similar for StateDirectory=. Benefit would be clear delegation via unit files: services get the directories they get, and nothing else even if they wanted to. * landlock: for unprivileged systemd (i.e. systemd --user), use landlock to implement ProtectSystem=, ProtectHome= and so on. Landlock does not require privs, and we can implement pretty similar behaviour. Also, maybe add a mode where ProtectSystem= combined with an explicit PrivateMounts=no could request similar behaviour for system services, too. * Add systemd-mount@.service which is instantiated for a block device and invokes systemd-mount and exits. This is then useful to use in ENV{SYSTEMD_WANTS} in udev rules, and a bit prettier than using RUN+= * udevd: extend memory pressure logic: also kill any idle worker processes * udevadm: to make symlink querying with udevadm nicer: - do not enable the pager for queries like 'udevadm info -q symlink -r' - add mode with newlines instead of spaces (for grep)? * SIGRTMIN+18 and memory pressure handling should still be added to: hostnamed, localed, oomd, timedated. * repart/gpt-auto/DDIs: maybe introduce a concept of "extension" partitions, that have a new type uuid and can "extend" earlier partitions, to work around the fact that systemd-repart can only grow the last partition defined. During activation we'd simply set up a dm-linear mapping to merge them again. A partition that is to be extended would just set a bit in the partition flags field to indicate that there's another extension partition to look for. The identifying UUID of the extension partition would be hashed in counter mode from the uuid of the original partition it extends. Inspiration for this is the "dynamic partitions" concept of new Android. This would be a minimalistic concept of a volume manager, with the extents it manages being exposes as GPT partitions. I a partition is extended multiple times they should probably grow exponentially in size to ensure O(log(n)) time for finding them on access. * Make nspawn to a frontend for systemd-executor, so that we have to ways into the executor: via unit files/dbus/varlink through PID1 and via cmdline/OCI through nspawn. * sd-stub: detect if we are running with uefi console output on serial, and if so automatically add console= to kernel cmdline matching the same port. * add a utility that can be used with the kernel's CONFIG_STATIC_USERMODEHELPER_PATH and then handles them within pid1 so that security, resource management and cgroup settings can be enforced properly for all umh processes. * homed: when resizing an fs don't sync identity beforehand there might simply not be enough disk space for that. try to be defensive and sync only after resize. * homed: if for some reason the partition ended up being much smaller than whole disk, recover from that, and grow it again. * timesyncd: when saving/restoring clock try to take boot time into account. Specifically, along with the saved clock, store the current boot ID. When starting, check if the boot id matches. If so, don't do anything (we are on the same boot and clock just kept running anyway). If not, then read CLOCK_BOOTTIME (which started at boot), and add it to the saved clock timestamp, to compensate for the time we spent booting. If EFI timestamps are available, also include that in the calculation. With this we'll then only miss the time spent during shutdown after timesync stopped and before the system actually reset. * systemd-stub: maybe store a "boot counter" in the ESP, and pass it down to userspace to allow ordering boots (for example in journalctl). The counter would be monotonically increased on every boot. * pam_systemd_home: add module parameter to control whether to only accept only password or only pcks11/fido2 auth, and then use this to hook nicely into two of the three PAM stacks gdm provides. See discussion at https://github.com/authselect/authselect/pull/311 * sd-boot: make boot loader spec type #1 accept http urls in "linux" lines. Then, do the uefi http dance to download kernels and boot them. This is then useful for network boot, by embedding a cpio with type #1 snippets in sd-boot, which reference remote kernels. * maybe prohibit setuid() to the nobody user, to lock things down, via seccomp. the nobody is not a user any code should run under, ever, as that user would possibly get a lot of access to resources it really shouldn't be getting access to due to the userns + nfs semantics of the user. Alternatively: use the seccomp log action, and allow it. * sd-boot: add a new PE section .bls or so that carries a cpio with additional boot loader entries (both type1 and type2). Then when initializing, find this section, iterate through it and populate menu with it. cpio is simple enough to make a parser for this reasonably robust. use same path structures as in the ESP. Similar add one for signature key drop-ins. * sd-boot: also allow passing in the cpio as in the previous item via SMBIOS * add a new EFI tool "sd-fetch" or so. It looks in a PE section ".url" for an URL, then downloads the file from it using UEFI HTTP APIs, and executes it. Use case: provide a minimal ESP with sd-boot and a couple of these sd-fetch binaries in place of UKIs, and download them on-the-fly. * maybe: systemd-loop-generator that sets up loopback devices if requested via kernel cmdline. use case: include encrypted/verity root fs in UKI. * systemd-gpt-auto-generator: add kernel cmdline option to override block device to dissect. also support dissecting a regular file. useccase: include encrypted/verity root fs in UKI. * sd-stub: add ".bootcfg" section for kernel bootconfig data (as per https://docs.kernel.org/admin-guide/bootconfig.html) * tpm2: add (optional) support for generating a local signing key from PCR 15 state. use private key part to sign PCR 7+14 policies. stash signatures for expected PCR7+14 policies in EFI var. use public key part in disk encryption. generate new sigs whenever db/dbx/mok/mokx gets updated. that way we can securely bind against SecureBoot/shim state, without having to renroll everything on each update (but we still have to generate one sig on each update, but that should be robust/idempotent). needs rollback protection, as usual. * Lennart: big blog story about DDIs * Lennart: big blog story about building initrds * Lennart: big blog story about "why systemd-boot" * bpf: see if we can use BPF to solve the syslog message cgroup source problem: one idea would be to patch source sockaddr of all AF_UNIX/SOCK_DGRAM to implicitly contain the source cgroup id. Another idea would be to patch sendto()/connect()/sendmsg() sockaddr on-the-fly to use a different target sockaddr. * bpf: see if we can address opportunistic inode sharing of immutable fs images with BPF. i.e. if bpf gives us power to hook into openat() and return a different inode than is requested for which we however it has same contents then we can use that to implement opportunistic inode sharing among DDIs: make all DDIs ship xattr on all reg files with a SHA256 hash. Then, also dictate that DDIs should come with a top-level subdir where all reg files are linked into by their SHA256 sum. Then, whenever an inode is opened with the xattr set, check bpf table to find dirs with hashes for other prior DDIs and try to use inode from there. * extend the verity signature partition to permit multiple signatures for the same root hash, so that people can sign a single image with multiple keys. * consider adding a new partition type, just for /opt/ for usage in system extensions * gpt-auto-discovery: also use the pkcs7 signature stuff, and pass signature to kernel. So far we only did this for the various --image= switches, but not for the root fs or /usr/. * dissection policy should enforce that unlocking can only take place by certain means, i.e. only via pw, only via tpm2, or only via fido, or a combination thereof. * make the systemd-repart "seed" value provisionable via credentials, so that confidential computing environments can set it and deterministically enforce the uuids for partitions created, so that they can calculate PCR 15 ahead of time. * systemd-repart: also derive the volume key from the seed value, for the aforementioned purpose. * in the initrd: derive the default machine ID to pass to the host PID 1 via $machine_id from the same seed credential. * Add systemd-sysupdate-initrd.service or so that runs systemd-sysupdate in the initrd to bootstrap the initrd to populate the initial partitions. Some things to figure out: - Should it run on firstboot or on every boot? - If run on every boot, should it use the sysupdate config from the host on subsequent boots? * revisit default PCR bindings in cryptenroll and systemd-creds. Currently they use PCR 7 which should contain secureboot state db/dbx. Which sounded like a safe bet, given that it should change only on policy changes, and not software updates. But that's wrong. Recent fwupd (rightfully) contains code for updating the dbx denylist. This means even without any active policy change PCR 7 might change. Hence, better idea might be in systemd-creds to default to PCR 15 at least if sd-stub is used (i.e. bind to system identity), and in cryptsetup simply the empty list? Also, PCR 14 almost certainly should be included as much as PCR 7 (as it contains shim's policy, which is certainly as relevant as PCR 7 on many systems) * To mimic the new tpm2-measure-pcr= crypttab option add the same to veritytab (measuring the root hash) and integritytab (measuring the HMAC key if one is used) * We should start measuring all services, containers, and system extensions we activate. probably into PCR 13. i.e. add --tpm2-measure-pcr= or so to systemd-nspawn, and MeasurePCR= to unit files. Should contain a measurement of the activated configuration and the image that is being activated (in case verity is used, hash of the root hash). * bootspec: permit graceful "update" from type #2 to type #1. If both a type #1 and a type #2 entry exist under otherwise the exact same name, then use the type #1 entry, and ignore the type #2 entry. This way, people can "upgrade" from the UKI with all parameters baked in to a Type #1 .conf file with manual parametrization, if needed. This matches our usual rule that admin config should win over vendor defaults. * write a "search path" spec, that documents the prefixes to search in (i.e. the usual /etc/, /run/, /usr/lib/ dance, potentially /usr/etc/), how to sort found entries, how masking works and overriding. * automatic boot assessment: add one more default success check that just waits for a bit after boot, and blesses the boot if the system stayed up that long. * systemd-repart: add support for generating ISO9660 images * systemd-repart: in addition to the existing "factory reset" mode (which simply empties existing partitions marked for that). add a mode where partitions marked for it are entirely removed. Use case: remove secondary OS copy, and redundant partitions entirely, and recreate them anew. * systemd-boot: maybe add support for collapsing menu entries of the same OS into one item that can be opened (like in a "tree view" UI element) or collapsed. If only a single OS is installed, disable this mode, but if multiple OSes are installed might make sense to default to it, so that user is not immediately bombarded with a multitude of Linux kernel versions but only one for each OS. * systemd-repart: if the GPT *disk* UUID (i.e. the one global for the entire disk) is set to all FFFFF then use this as trigger for factory reset, in addition to the existing mechanisms via EFI variables and kernel command line. Benefit: works also on non-EFI systems, and can be requested on one boot, for the next. * systemd-sysupdate: make transport pluggable, so people can plug casync or similar behind it, instead of http. * systemd-tmpfiles: add concept for conditionalizing lines on factory reset boot, or on first boot. * we probably needs .pcrpkeyrd or so as additional PE section in UKIs, which contains a separate public key for PCR values that only apply in the initrd, i.e. in the boot phase "enter-initrd". Then, consumers in userspace can easily bind resources to just the initrd. Similar, maybe one more for "enter-initrd:leave-initrd" for resources that shall be accessible only before unprivileged user code is allowed. (we only need this for .pcrpkey, not for .pcrsig, since the latter is a list of signatures anyway). With that, when you enroll a LUKS volume or similar, pick either the .pcrkey (for coverage through all phases of the boot, but excluding shutdown), the .pcrpkeyrd (for coverage in the initrd only) and .pcrpkeybt (for coverage until users are allowed to log in). * Once the root fs LUKS volume key is measured into PCR 15, default to binding credentials to PCR 15 in "systemd-creds" * add support for asymmetric LUKS2 TPM based encryption. i.e. allow preparing an encrypted image on some host given a public key belonging to a specific other host, so that only hosts possessing the private key in the TPM2 chip can decrypt the volume key and activate the volume. Use case: systemd-confext for a central orchestrator to generate confext images securely that can only be activated on one specific host (which can be used for installing a bunch of creds in /etc/credstore/ for example). Extending on this: allow binding LUKS2 TPM based encryption also to the TPM2 internal clock. Net result: prepare a confext image that can only be activated on a specific host that runs a specific software in a specific time window. confext would be automatically invalidated outside of it. * maybe add a "systemd-report" tool, that generates a TPM2-backed "report" of current system state, i.e. a combination of PCR information, local system time and TPM clock, running services, recent high-priority log messages/coredumps, system load/PSI, signed by the local TPM chip, to form an enhanced remote attestation quote. Use case: a simple orchestrator could use this: have the report tool upload these reports every 3min somewhere. Then have the orchestrator collect these reports centrally over a 3min time window, and use them to determine what which node should now start/stop what, and generate a small confext for each node, that uses Uphold= to pin services on each node. The confext would be encrypted using the asymmetric encryption proposed above, so that it can only be activated on the specific host, if the software is in a good state, and within a specific time frame. Then run a loop on each node that sends report to orchestrator and then sysupdate to update confext. Orchestrator would be stateless, i.e. operate on desired config and collected reports in the last 3min time window only, and thus can be trivially scaled up since all instances of the orchestrator should come to the same conclusions given the same inputs of reports/desired workload info. Could also be used to deliver Wireguard secrets and thus to clients, thus permitting zero-trust networking: secrets are rolled over via confext updates, and via the time window TPM logic invalidated if node doesn't keep itself updated, or becomes corrupted in some way. * in the initrd, once the rootfs encryption key has been measured to PCR 15, derive default machine ID to use from it, and pass it to host PID 1. * sd-boot: for each installed OS, grey out older entries (i.e. all but the newest), to indicate they are obsolete * automatically propagate LUKS password credential into cryptsetup from host (i.e. SMBIOS type #11, …), so that one can unlock LUKS via VM hypervisor supplied password. * add ability to path_is_valid() to classify paths that refer to a dir from those which may refer to anything, and use that in various places to filter early. i.e. stuff ending in "/", "/." and "/.." definitely refers to a directory, and paths ending that way can be refused early in many contexts. * systemd-measure: add --pcrpkey-auto as an alternative to --pcrpkey=, where it would just use the same public key specified with --public-key= (or the one automatically derived from --private-key=). * Add "purpose" flag to partition flags in discoverable partition spec that indicate if partition is intended for sysext, for portable service, for booting and so on. Then, when dissecting DDI allow specifying a purpose to use as additional search condition. Use case: images that combined a sysext partition with a portable service partition in one. * On boot, auto-generate an asymmetric key pair from the TPM, and use it for validating DDIs and credentials. Maybe upload it to the kernel keyring, so that the kernel does this validation for us for verity and kernel modules * lock down acceptable encrypted credentials at boot, via simple allowlist, maybe on kernel command line: systemd.import_encrypted_creds=foobar.waldo,tmpfiles.extra to protect locked down kernels from credentials generated on the host with a weak kernel * Merge systemd-creds options --uid= (which accepts user names) and --user. * Add support for extra verity configuration options to systemd-repart (FEC, hash type, etc) * chase(): take inspiration from path_extract_filename() and return O_DIRECTORY if input path contains trailing slash. * chase(): refuse resolution if trailing slash is specified on input, but final node is not a directory * document in boot loader spec that symlinks in XBOOTLDR/ESP are not OK even if non-VFAT fs is used. * measure credentials picked up from SMBIOS to some suitable PCR * measure GPT and LUKS headers somewhere when we use them (i.e. in systemd-gpt-auto-generator/systemd-repart and in systemd-cryptsetup?) * pick up creds from EFI vars * Add and pickup tpm2 metadata for creds structure. * sd-boot: we probably should include all BootXY EFI variable defined boot entries in our menu, and then suppress ourselves. Benefit: instant compatibility with all other OSes which register things there, in particular on other disks. Always boot into them via NextBoot EFI variable, to not affect PCR values. * systemd-measure tool: - pre-calculate PCR 12 (command line) + PCR 13 (sysext) the same way we can precalculate PCR 11 * in sd-boot: load EFI drivers from a new PE section. That way, one can have a "supercharged" sd-boot binary, that could carry ext4 drivers built-in. * sd-device: add an API for acquiring list of child devices, given a device objects (i.e. all child dirents that dirs or symlinks to dirs) * sd-device: maybe pin the sysfs dir with an fd, during the entire runtime of an sd_device, then always work based on that. * maybe add new flags to gpt partition tables for rootfs and usrfs indicating purpose, i.e. whether something is supposed to be bootable in a VM, on baremetal, on an nspawn-style container, if it is a portable service image, or a sysext for initrd, for host os, or for portable container. Then hook portabled/… up to udev to watch block devices coming up with the flags set, and use it. * sd-boot should look for information what to boot in SMBIOS, too, so that VM managers can tell sd-boot what to boot into and suchlike * add "systemd-sysext identify" verb, that you can point on any file in /usr/ and that determines from which overlayfs layer it originates, which image, and with what it was signed. * systemd-creds: extend encryption logic to support asymmetric encryption/authentication. Idea: add new verb "systemd-creds public-key" which generates a priv/pub key pair on the TPM2 and stores the priv key locally in /var. It then outputs a certificate for the pub part to stdout. This can then be copied/taken elsewhere, and can be used for encrypting creds that only the host on its specific hw can decrypt. Then, support a drop-in dir with certificates that can be used to authenticate credentials. Flow of operations is then this: build image with owner certificate, then after boot up issue "systemd-creds public-key" to acquire pubkey of the machine. Then, when passing data to the machine, sign with privkey belonging to one of the dropped in certs and encrypted with machine pubkey, and pass to machine. Machine is then able to authenticate you, and confidentiality is guaranteed. * building on top of the above, the pub/priv key pair generated on the TPM2 should probably also one you can use to get a remote attestation quote. * Process credentials in: • crypttab-generator: allow defining additional crypttab-like volumes via credentials (similar: verity-generator, integrity-generator). Use fstab-generator logic as inspiration. • run-generator: allow defining additional commands to run via a credential • resolved: allow defining additional /etc/hosts entries via a credential (it might make sense to then synthesize a new combined /etc/hosts file in /run and bind mount it on /etc/hosts for other clients that want to read it. • repart: allow defining additional partitions via credential • timesyncd: pick NTP server info from credential • portabled: read a credential "portable.extra" or so, that takes a list of file system paths to enable on start. • make systemd-fstab-generator look for a system credential encoding root= or usr= • in gpt-auto-generator: check partition uuids against such uuids supplied via sd-stub credentials. That way, we can support parallel OS installations with pre-built kernels. * define a JSON format for units, separating out unit definitions from unit runtime state. Then, expose it: 1. Add Describe() method to Unit D-Bus object that returns a JSON object about the unit. 2. Expose this natively via Varlink, in similar style 3. Use it when invoking binaries (i.e. make PID 1 fork off systemd-executor binary which reads the JSON definition and runs it), to address the cow trap issue and the fact that NSS is actually forbidden in forked-but-not-exec'ed children 4. Add varlink API to run transient units based on provided JSON definitions * Add SUPPORT_END_URL= field to os-release with more *actionable* information what to do if support ended * pam_systemd: on interactive logins, maybe show SUPPORT_END information at login time, à la motd * sd-boot: instead of unconditionally deriving the ESP to search boot loader spec entries in from the paths of sd-boot binary, let's optionally allow it to be configured on sd-boot cmdline + efi var. Use case: embed sd-boot in the UEFI firmware (for example, ovmf supports that via qemu cmdline option), and use it to load stuff from the ESP. * mount /var/ from initrd, so that we can apply sysext and stuff before the initrd transition. Specifically: 1. There should be a var= kernel cmdline option, matching root= and usr= 2. systemd-gpt-auto-generator should auto-mount /var if it finds it on disk 3. mount.x-initrd mount option in fstab should be implied for /var * make persistent restarts easier by adding a new setting OpenPersistentFile= or so, which allows opening one or more files that is "persistent" across service restarts, hot reboot, cold reboots (depending on configuration): the files are created empty on first invocation, and on subsequent invocations the files are reboot. The files would be backed by tmpfs, pmem or /var depending on desired level of persistency. * sd-event: add ability to "chain" event sources. Specifically, add a call sd_event_source_chain(x, y), which will automatically enable event source y in oneshot mode once x is triggered. Use case: in src/core/mount.c implement the /proc/self/mountinfo rescan on SIGCHLD with this: whenever a SIGCHLD is seen, trigger the rescan defer event source automatically, and allow it to be dispatched *before* the SIGCHLD is handled (based on priorities). Benefit: dispatch order is strictly controlled by priorities again. (next step: chain event sources to the ratelimit being over) * if we fork of a service with StandardOutput=journal, and it forks off a subprocess that quickly dies, we might not be able to identify the cgroup it comes from, but we can still derive that from the stdin socket its output came from. We apparently don't do that right now. * add ability to set hostname with suffix derived from machine id at boot * add PR_SET_DUMPABLE service setting * homed/userdb: maybe define a "companion" dir for home directories where apps can safely put privileged stuff in. Would not be writable by the user, but still conceptually belong to the user. Would be included in user's quota if possible, even if files are not owned by UID of user. Use case: container images that owned by arbitrary UIDs, and are owned/managed by the users, but are not directly belonging to the user's UID. Goal: we shouldn't place more privileged dirs inside of unprivileged dirs, and thus containers really should not be placed inside of traditional UNIX home dirs (which are owned by users themselves) but somewhere else, that is separate, but still close by. Inform user code about path to this companion dir via env var, so that container managers find it. the ~/.identity file is also a candidate for a file to move there, since it is managed by privileged code (i.e. homed) and not unprivileged code. * maybe add support for binding and connecting AF_UNIX sockets in the file system outside of the 108ch limit. When connecting, open O_PATH fd to socket inode first, then connect to /proc/self/fd/XYZ. When binding, create symlink to target dir in /tmp, and bind through it. * add a proper concept of a "developer" mode, i.e. where cryptographic protections of the root OS are weakened after interactive confirmation, to allow hackers to allow their own stuff. idea: allow entering developer mode only via explicit choice in boot menu: i.e. add explicit boot menu item for it. When developer mode is entered, generate a key pair in the TPM2, and add the public part of it automatically to keychain of valid code signature keys on subsequent boots. Then provide a tool to sign code with the key in the TPM2. Ensure that boot menu item is the only way to enter developer mode, by binding it to locality/PCRs so that keys cannot be generated otherwise. * services: add support for cryptographically unlocking per-service directories via TPM2. Specifically, for StateDirectory= (and related dirs) use fscrypt to set up the directory so that it can only be accessed if host and app are in order. * update HACKING.md to suggest developing systemd with the ideas from: https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html https://0pointer.net/blog/running-an-container-off-the-host-usr.html * sd-event: compat wd reuse in inotify code: keep a set of removed watch descriptors, and clear this set piecemeal when we see the IN_IGNORED event for it, or when read() returns EAGAIN or on IN_Q_OVERFLOW. Then, whenever we see an inotify wd event check against this set, and if it is contained ignore the event. (to be fully correct this would have to count the occurrences, in case the same wd is reused multiple times before we start processing IN_IGNORED again) * for vendor-built signed initrds: - kernel-install should be able to install encrypted creds automatically for machine id, root pw, rootfs uuid, resume partition uuid, and place next to EFI kernel, for sd-stub to pick them up. These creds should be locked to the TPM, and bind to the right PCR the kernel is measured to. - kernel-install should be able to pick up initrd sysexts automatically and place them next to EFI kernel, for sd-stub to pick them up. - systemd-fstab-generator should look for rootfs device to mount in creds - systemd-resume-generator should look for resume partition uuid in creds - sd-stub: automatically pick up microcode from ESP (/loader/microcode/*) and synthesize initrd from it, and measure it. Signing is not necessary, as microcode does that on its own. Pass as first initrd to kernel. * Maybe extend the service protocol to support handling of some specific SIGRT signal for setting service log level, that carries the level via the sigqueue() data parameter. Enable this via unit file setting. * sd_notify/vsock: maybe support binding to AF_VSOCK in Type=notify services, then passing $NOTIFY_SOCKET and $NOTIFY_GUESTCID with PID1's cid (typically fixed to "2", i.e. the official host cid) and the expected guest cid, for the two sides of the channel. The latter env var could then be used in an appropriate qemu cmdline. That way qemu payloads could talk sd_notify() directly to host service manager. * sd-device should return the devnum type (i.e. 'b' or 'c') via some API for an sd_device object, so that data passed into sd_device_new_from_devnum() can also be queried. * sd-event: optionally, if per-event source rate limit is hit, downgrade priority, but leave enabled, and once ratelimit window is over, upgrade priority again. That way we can combat event source starvation without stopping processing events from one source entirely. * sd-event: similar to existing inotify support add fanotify support (given that apparently new features in this area are only going to be added to the latter). * sd-event: add 1st class event source for clock changes * sd-event: add 1st class event source for timezone changes * support uefi/http boots with sd-boot: instead of looking for dropin files in /loader/entries/ dir, look for a file /loader/entries/SHA256SUMS and use that as directory manifest. The file would be a standard directory listing as generated by GNU sha256sums. * sd-boot: maybe add support for embedding the various auxiliary resources we look for right in the sd-boot binary. i.e. take inspiration from sd-stub logic: allow combining sd-boot via ukify with kernels to enumerate, .conf files, drivers, keys to enroll and so on. Then, add whatever we find that way to the menu. Use case: allow building a single PE image you can boot into via UEFI HTTP boot. * maybe add a new UEFI stub binary "sd-http". It works similar to sd-stub, but all it does is download a file from a http server, and execute it, after optionally checking its hash sum. idea would be: combine this "sd-http" stub binary with some minimal info about a URL + hash sum, plus .osrel data, and drop it into the unified kernel dir in the ESP. And bam you have something that is tiny, feels a lot like a unified kernel, but all it does is chainload the real kernel. benefit: downloading these stubs would be tiny and quick, hence cheap for enumeration. * sysext: measure all activated sysext into a TPM PCR * systemd-dissect: show available versions inside of a disk image, i.e. if multiple versions are around of the same resource, show which ones. (in other words: show partition labels). * systemd-dissect: add --cat switch for dumping files such as /etc/os-release * per-service sandboxing option: ProtectIds=. If used, will overmount /etc/machine-id and /proc/sys/kernel/random/boot_id with synthetic files, to make it harder for the service to identify the host. Depending on the user setting it should be fully randomized at invocation time, or a hash of the real thing, keyed by the unit name or so. Of course, there are other ways to get these IDs (e.g. journal) or similar ids (e.g. MAC addresses, DMI ids, CPU ids), so this knob would only be useful in combination with other lockdown options. Particularly useful for portable services, and anything else that uses RootDirectory= or RootImage=. (Might also over-mount /sys/class/dmi/id/*{uuid,serial} with /dev/null). * doc: prep a document explaining resolved's internal objects, i.e. Query vs. Question vs. Transaction vs. Stream and so on. * doc: prep a document explaining PID 1's internal logic, i.e. transactions, jobs, units * automatically ignore threaded cgroups in cg_xyz(). * add linker script that implicitly adds symbol for build ID and new coredump json package metadata, and use that when logging * Enable RestrictFileSystems= for all our long-running services (similar: RestrictNetworkInterfaces=) * Add systemd-analyze security checks for RestrictFileSystems= and RestrictNetworkInterfaces= * cryptsetup/homed: implement TOTP authentication backed by TPM2 and its internal clock. * man: rework os-release(5), and clearly separate our extension-release.d/ and initrd-release parts, i.e. list explicitly which fields are about what. * sysext: before applying a sysext, do a superficial validation run so that things are not rearranged to wildy. I.e. protect against accidental fuckups, such as masking out /usr/lib/ or so. We should probably refuse if existing inodes are replaced by other types of inodes or so. * userdb: when synthesizing NSS records, pick "best" password from defined passwords, not just the first. i.e. if there are multiple defined, prefer unlocked over locked and prefer non-empty over empty. * homed: if the homed shell fallback thing has access to an SSH agent, try to use it to unlock home dir (if ssh-agent forwarding is enabled). We could implement SSH unlocking of a homedir with that: when enrolling a new ssh pubkey in a user record we'd ask the ssh-agent to sign some random value with the privkey, then use that as luks key to unlock the home dir. Will not work for ECDSA keys since their signatures contain a random component, but will work for RSA and Ed25519 keys. * add tiny service that decrypts encrypted user records passed via initrd credential logic and drops them into /run where nss-systemd can pick them up, similar to /run/host/userdb/. Use case: drop a root user JSON record there, and use it in the initrd to log in as root with locally selected password, for debugging purposes. Other use case: boot into qemu with regular user mounted from host. maybe put this in systemd-user-sessions.service? * drop dependency on libcap, replace by direct syscalls based on CapabilityQuintet we already have. (This likely allows us to drop libcap dep in the base OS image) * userdbd: implement an additional varlink service socket that provides the host user db in restricted form, then allow this to be bind mounted into sandboxed environments that want the host database in minimal form. All records would be stripped of all meta info, except the basic UID/name info. Then use this in portabled environments that do not use PrivateUsers=1. * portabled: when extracting unit files and copying to system.attached, if a .p7s is available in the image, use it to protect the system.attached copy with fs-verity, so that it cannot be tampered with * /etc/veritytab: allow that the roothash column can be specified as fs path including a path to an AF_UNIX path, similar to how we do things with the keys of /etc/crypttab. That way people can store/provide the roothash externally and provide to us on demand only. * we probably should extend the root verity hash of the root fs into some PCR on boot. (i.e. maybe add a veritytab option tpm2-measure=12 or so to measure it into PCR 12); Similar: we probably should extend the LUKS volume key of the root fs into some PCR on boot. (i.e. maybe add a crypttab option tpm2-measure=15 or so to measure it into PCR 15); once both are in place update gpt-auto-discovery to generate these by default for the partitions it discovers. Static vendor stuff should probably end up in PCR 12 (i.e. the verity hash), with local keys in PCR 15 (i.e. the encryption volume key). That way, we nicely distinguish resources supplied by the OS vendor (i.e. sysext, root verity) from those inherently local (i.e. encryption key), which is useful if they shall be signed separately. * in uefi stub: query firmware regarding which PCR banks are being used, store that in EFI var. then use this when enrolling TPM2 in cryptsetup to verify that the selected PCRs actually are used by firmware. * rework recursive read-only remount to use new mount API * PAM: pick up authentication token from credentials * when mounting disk images: if IMAGE_ID/IMAGE_VERSION is set in os-release data in the image, make sure the image filename actually matches this, so that images cannot be misused. * New udev block device symlink names: /dev/disk/by-parttypelabel/-. Use case: if pt label is used as partition image version string, this is a safe way to reference a specific version of a specific partition type, in particular where related partitions are processed (e.g. verity + rootfs both named "LennartOS_0.7"). * sysupdate: - add fuzzing to the pattern parser - support casync as download mechanism - "systemd-sysupdate update --all" support, that iterates through all components defined on the host, plus all images installed into /var/lib/machines/, /var/lib/portable/ and so on. - Allow invocation with a single transfer definition, i.e. with --definitions= pointing to a file rather than a dir. - add ability to disable implicit decompression of downloaded artifacts, i.e. a Compress=no option in the transfer definitions * in sd-id128: also parse UUIDs in RFC4122 URN syntax (i.e. chop off urn:uuid: prefix) * systemd-sysext: optionally, run it in initrd already, before transitioning into host, to open up possibility for services shipped like that. * introduce /dev/disk/root/* symlinks that allow referencing partitions on the disk the rootfs is on in a reasonably secure way. (or maybe: add /dev/gpt-auto-{home,srv,boot,…} similar in style to /dev/gpt-auto-root as we already have it. * whenever we receive fds via SCM_RIGHTS make sure none got dropped due to the reception limit the kernel silently enforces. * Add service unit setting ConnectStream= which takes IP addresses and connects to them. * Similar, Load= which takes literal data in text or base64 format, and puts it into a memfd, and passes that. This enables some fun stuff, such as embedding bash scripts in unit files, by combining Load= with ExecStart=/bin/bash /proc/self/fd/3 * add a ConnectSocket= setting to service unit files, that may reference a socket unit, and which will connect to the socket defined therein, and pass the resulting fd to the service program via socket activation proto. * Add a concept of ListenStream=anonymous to socket units: listen on a socket that is deleted in the fs. Use case would be with ConnectSocket= above. * importd: support image signature verification with PKCS#7 + OpenBSD signify logic, as alternative to crummy gpg * add "systemd-analyze debug" + AttachDebugger= in unit files: The former specifies a command to execute; the latter specifies that an already running "systemd-analyze debug" instance shall be contacted and execution paused until it gives an OK. That way, tools like gdb or strace can be safely be invoked on processes forked off PID 1. * expose MS_NOSYMFOLLOW in various places * credentials system: - acquire from EFI variable? - acquire via ask-password? - acquire creds via keyring? - pass creds via keyring? - pass creds via memfd? - acquire + decrypt creds from pkcs11? - make PAMName= acquire pw via creds logic - make macsec code in networkd read key via creds logic (copy logic from wireguard) - make gatewayd/remote read key via creds logic - add sd_notify() command for flushing out creds not needed anymore * TPM2: auto-reenroll in cryptsetup, as fallback for hosed firmware upgrades and such * introduce a new group to own TPM devices * cryptsetup: add option for automatically removing empty password slot on boot * cryptsetup: optionally, when run during boot-up and password is never entered, and we are on battery power (or so), power off machine again * cryptsetup: when waiting for FIDO2/PKCS#11 token, tell plymouth that, and allow plymouth to abort the waiting and enter pw instead * make cryptsetup lower --iter-time * cryptsetup: allow encoding key directly in /etc/crypttab, maybe with a "base64:" prefix. Useful in particular for pkcs11 mode. * cryptsetup: reimplement the mkswap/mke2fs in cryptsetup-generator to use systemd-makefs.service instead. * cryptsetup: - cryptsetup-generator: allow specification of passwords in crypttab itself - support rd.luks.allow-discards= kernel cmdline params in cryptsetup generator * systemd-analyze netif that explains predictable interface (or networkctl) * systemd-analyze inspect-elf should show other notes too, at least build-id. * Figure out naming of verbs in systemd-analyze: we have (singular) capability, exit-status, but (plural) filesystems, architectures. * Add service setting to run a service within the specified VRF. i.e. do the equivalent of "ip vrf exec". * special case some calls of chase() to use openat2() internally, so that the kernel does what we otherwise do. * add a new flag to chase() that stops chasing once the first missing component is found and then allows the caller to create the rest. * make use of new glibc 2.32 APIs sigabbrev_np() and strerrorname_np(). * if /usr/bin/swapoff fails due to OOM, log a friendly explanatory message about it * pid1: also remove PID files of a service when the service starts, not just when it exits * make us use dynamically fewer deps for containers in general purpose distros: o turn into dlopen() deps: - libblkid (only in RootImage= handling in PID 1, but not elsewhere) - libpam (only when called from PID 1) * seccomp: maybe use seccomp_merge() to merge our filters per-arch if we can. Apparently kernel performance is much better with fewer larger seccomp filters than with more smaller seccomp filters. * systemd-path: Add "private" runtime/state/cache dir enum, mapping to $RUNTIME_DIRECTORY, $STATE_DIRECTORY and such * seccomp: by default mask x32 ABI system wide on x86-64. it's on its way out * seccomp: don't install filters for ABIs that are masked anyway for the specific service * busctl: maybe expose a verb "ping" for pinging a dbus service to see if it exists and responds. * socket units: allow creating a udev monitor socket with ListenDevices= or so, with matches, then activate app through that passing socket over * unify on openssl: - kill gnutls support in resolved - figure out what to do about libmicrohttpd, which has a hard dependency on gnutls - port fsprg over to a dlopen lib, then switch it to openssl * add growvol and makevol options for /etc/crypttab, similar to x-systemd.growfs and x-systemd-makefs. * userdb: allow username prefix searches in varlink API, allow realname and realname substr searches in varlink API * userdb: allow uid/gid range checks * userdb: allow existence checks * pid1: activation by journal search expression * when switching root from initrd to host, set the machine_id env var so that if the host has no machine ID set yet we continue to use the random one the initrd had set. * sd-event: add native support for P_ALL waitid() watching, then move PID 1 to it for reaping assigned but unknown children. This needs to some special care to operate somewhat sensibly in light of priorities: P_ALL will return arbitrary processes, regardless of the priority we want to watch them with, hence on each event loop iteration check all processes which we shall watch with higher prio explicitly, and then watch the entire rest with P_ALL. * tweak sd-event's child watching: keep a prioq of children to watch and use waitid() only on the children with the highest priority until one is waitable and ignore all lower-prio ones from that point on * maybe introduce xattrs that can be set on the root dir of the root fs partition that declare the volatility mode to use the image in. Previously I thought marking this via GPT partition flags but that's not ideal since that's outside of the LUKS encryption/verity verification, and we probably shouldn't operate in a volatile mode unless we got told so from a trusted source. * coredump: maybe when coredumping read a new xattr from /proc/$PID/exe that may be used to mark a whole binary as non-coredumpable. Would fix: https://bugs.freedesktop.org/show_bug.cgi?id=69447 * teach parse_timestamp() timezones like the calendar spec already knows it * We should probably replace /etc/rc.d/README with a symlink to doc content. After all it is constant vendor data. * maybe add kernel cmdline params: to force random seed crediting * let's not GC a unit while its ratelimits are still pending * when killing due to service watchdog timeout maybe detect whether target process is under ptracing and then log loudly and continue instead. * make rfkill uaccess controllable by default, i.e. steal rule from gnome-bluetooth and friends * make MAINPID= message reception checks even stricter: if service uses User=, then check sending UID and ignore message if it doesn't match the user or root. * maybe trigger a uevent "change" on a device if "systemctl reload xyz.device" is issued. * when importing an fs tree with machined, optionally apply userns-rec-chown * when importing an fs tree with machined, complain if image is not an OS * Maybe introduce a helper safe_exec() or so, which is to execve() which safe_fork() is to fork(). And then make revert the RLIMIT_NOFILE soft limit to 1K implicitly, unless explicitly opted-out. * rework seccomp/nnp logic that even if User= is used in combination with a seccomp option we don't have to set NNP. For that, change uid first whil keeping CAP_SYS_ADMIN, then apply seccomp, the drop cap. * when no locale is configured, default to UEFI's PlatformLang variable * add a new syscall group "@esoteric" for more esoteric stuff such as bpf() and usefaultd() and make systemd-analyze check for it. * paranoia: whenever we process passwords, call mlock() on the memory first. i.e. look for all places we use free_and_erasep() and augment them with mlock(). Also use MADV_DONTDUMP. Alternatively (preferably?) use memfd_secret(). * Move RestrictAddressFamily= to the new cgroup create socket * optionally: turn on cgroup delegation for per-session scope units * sd-boot: optionally, show boot menu when previous default boot item has non-zero "tries done" count * augment CODE_FILE=, CODE_LINE= with something like CODE_BASE= or so which contains some identifier for the project, which allows us to include clickable links to source files generating these log messages. The identifier could be some abberviated URL prefix or so (taking inspiration from Go imports). For example, for systemd we could use CODE_BASE=github.com/systemd/systemd/blob/98b0b1123cc or so which is sufficient to build a link by prefixing "http://" and suffixing the CODE_FILE. * Augment MESSAGE_ID with MESSAGE_BASE, in a similar fashion so that we can make clickable links from log messages carrying a MESSAGE_ID, that lead to some explanatory text online. * maybe extend .path units to expose fanotify() per-mount change events * hibernate/s2h: if swap is on weird storage and refuse if so * cgroups: use inotify to get notified when somebody else modifies cgroups owned by us, then log a friendly warning. * beef up log.c with support for stripping ANSI sequences from strings, so that it is OK to include them in log strings. This would be particularly useful so that our log messages could contain clickable links for example for unit files and suchlike we operate on. * add support for "portablectl attach http://foobar.com/waaa.raw (i.e. importd integration) * sync dynamic uids/gids between host+portable srvice (i.e. if DynamicUser=1 is set for a service, make sure that the selected user is resolvable in the service even if it ships its own /etc/passwd) * Fix DECIMAL_STR_MAX or DECIMAL_STR_WIDTH. One includes a trailing NUL, the other doesn't. What a disaster. Probably to exclude it. * Check that users of inotify's IN_DELETE_SELF flag are using it properly, as usually IN_ATTRIB is the right way to watch deleted files, as the former only fires when a file is actually removed from disk, i.e. the link count drops to zero and is not open anymore, while the latter happens when a file is unlinked from any dir. * systemctl, machinectl, loginctl: port "status" commands over to format-table.c's vertical output logic. * pid1: lock image configured with RootDirectory=/RootImage= using the usual nspawn semantics while the unit is up * add --vacuum-xyz options to coredumpctl, matching those journalctl already has. * add CopyFile= or so as unit file setting that may be used to copy files or directory trees from the host to the services RootImage= and RootDirectory= environment. Which we can use for /etc/machine-id and in particular /etc/resolv.conf. Should be smart and do something useful on read-only images, for example fall back to read-only bind mounting the file instead. * bypass SIGTERM state in unit files if KillSignal is SIGKILL * add proper dbus APIs for the various sd_notify() commands, such as MAINPID=1 and so on, which would mean we could report errors and such. * introduce DefaultSlice= or so in system.conf that allows changing where we place our units by default, i.e. change system.slice to something else. Similar, ManagerSlice= should exist so that PID1's own scope unit could be moved somewhere else too. Finally machined and logind should get similar options so that it is possible to move user session scopes and machines to a different slice too by default. Use case: people who want to put resources on the entire system, with the exception of one specific service. See: https://lists.freedesktop.org/archives/systemd-devel/2018-February/040369.html * calenderspec: add support for week numbers and day numbers within a year. This would allow us to define "bi-weekly" triggers safely. * sd-bus: add vtable flag, that may be used to request client creds implicitly and asynchronously before dispatching the operation * sd-bus: parse addresses given in sd_bus_set_addresses immediately and not only when used. Add unit tests. * make use of ethtool veth peer info in machined, for automatically finding out host-side interface pointing to the container. * add some special mode to LogsDirectory=/StateDirectory=… that allows declaring these directories without necessarily pulling in deps for them, or creating them when starting up. That way, we could declare that systemd-journald writes to /var/log/journal, which could be useful when we doing disk usage calculations and so on. * deprecate RootDirectoryStartOnly= in favour of a new ExecStart= prefix char * support projid-based quota in machinectl for containers * add a way to lock down cgroup migration: a boolean, which when set for a unit makes sure the processes in it can never migrate out of it * blog about fd store and restartable services * document Environment=SYSTEMD_LOG_LEVEL=debug drop-in in debugging document * rework ExecOutput and ExecInput enums so that EXEC_OUTPUT_NULL loses its magic meaning and is no longer upgraded to something else if set explicitly. * in the long run: permit a system with /etc/machine-id linked to /dev/null, to make it lose its identity, i.e. be anonymous. For this we'd have to patch through the whole tree to make all code deal with the case where no machine ID is available. * optionally, collect cgroup resource data, and store it in per-unit RRD files, suitable for processing with rrdtool. Add bus API to access this data, and possibly implement a CPULoad property based on it. * beef up pam_systemd to take unit file settings such as cgroups properties as parameters * maybe hook up xfs/ext4 quotactl() with services? i.e. automatically manage the quota of the user indicated in User= via unit file settings, like the other resource management concepts. Would mix nicely with DynamicUser=1. Or alternatively, do this with projids, so that we can also cover services running as root. Quota should probably cover all the special dirs such as StateDirectory=, LogsDirectory=, CacheDirectory=, as well as RootDirectory= if it is set, plus the whole disk space any image configured with RootImage=. * In DynamicUser= mode: before selecting a UID, use disk quota APIs on relevant disks to see if the UID is already in use. * Add AddUser= setting to unit files, similar to DynamicUser=1 which however creates a static, persistent user rather than a dynamic, transient user. We can leverage code from sysusers.d for this. * add some optional flag to ReadWritePaths= and friends, that has the effect that we create the dir in question when the service is started. Example: ReadWritePaths=:/var/lib/foobar * Add ExecMonitor= setting. May be used multiple times. Forks off a process in the service cgroup, which is supposed to monitor the service, and when it exits the service is considered failed by its monitor. * track the per-service PAM process properly (i.e. as an additional control process), so that it may be queried on the bus and everything. * add a new "debug" job mode, that is propagated to unit_start() and for services results in two things: we raise SIGSTOP right before invoking execve() and turn off watchdog support. Then, use that to implement "systemd-gdb" for attaching to the start-up of any system service in its natural habitat. * gpt-auto logic: support encrypted swap, add kernel cmdline option to force it, and honour a gpt bit about it, plus maybe a configuration file * add a percentage syntax for TimeoutStopSec=, e.g. TimeoutStopSec=150%, and then use that for the setting used in user@.service. It should be understood relative to the configured default value. * enable LockMLOCK to take a percentage value relative to physical memory * Permit masking specific netlink APIs with RestrictAddressFamily= * define gpt header bits to select volatility mode * ProtectClock= (drops CAP_SYS_TIMES, adds seecomp filters for settimeofday, adjtimex), sets DeviceAllow o /dev/rtc * ProtectTracing= (drops CAP_SYS_PTRACE, blocks ptrace syscall, makes /sys/kernel/tracing go away) * ProtectMount= (drop mount/umount/pivot_root from seccomp, disallow fuse via DeviceAllow, imply Mountflags=slave) * ProtectKeyRing= to take keyring calls away * RemoveKeyRing= to remove all keyring entries of the specified user * ProtectReboot= that masks reboot() and kexec_load() syscalls, prohibits kill on PID 1 with the relevant signals, and makes relevant files in /sys and /proc (such as the sysrq stuff) unavailable * Support ReadWritePaths/ReadOnlyPaths/InaccessiblePaths in systemd --user instances via the new unprivileged Landlock LSM (https://landlock.io) * make sure the ratelimit object can deal with USEC_INFINITY as way to turn off things * in nss-systemd, if we run inside of RootDirectory= with PrivateUsers= set, find a way to map the User=/Group= of the service to the right name. This way a user/group for a service only has to exist on the host for the right mapping to work. * add bus API for creating unit files in /etc, reusing the code for transient units * add bus API to remove unit files from /etc * add bus API to retrieve current unit file contents (i.e. implement "systemctl cat" on the bus only) * rework fopen_temporary() to make use of open_tmpfile_linkable() (problem: the kernel doesn't support linkat() that replaces existing files, currently) * transient units: don't bother with actually setting unit properties, we reload the unit file anyway * optionally, also require WATCHDOG=1 notifications during service start-up and shutdown * cache sd_event_now() result from before the first iteration... * PID1: find a way how we can reload unit file configuration for specific units only, without reloading the whole of systemd * add an explicit parser for LimitRTPRIO= that verifies the specified range and generates sane error messages for incorrect specifications. * when we detect that there are waiting jobs but no running jobs, do something * PID 1 should send out sd_notify("WATCHDOG=1") messages (for usage in the --user mode, and when run via nspawn) * there's probably something wrong with having user mounts below /sys, as we have for debugfs. for example, src/core/mount.c handles mounts prefixed with /sys generally special. https://lists.freedesktop.org/archives/systemd-devel/2015-June/032962.html * fstab-generator: default to tmpfs-as-root if only usr= is specified on the kernel cmdline * docs: bring https://systemd.io/MY_SERVICE_CANT_GET_REALTIME up to date * add a job mode that will fail if a transaction would mean stopping running units. Use this in timedated to manage the NTP service state. https://lists.freedesktop.org/archives/systemd-devel/2015-April/030229.html * The udev blkid built-in should expose a property that reflects whether media was sensed in USB CF/SD card readers. This should then be used to control SYSTEMD_READY=1/0 so that USB card readers aren't picked up by systemd unless they contain a medium. This would mirror the behaviour we already have for CD drives. * hostnamectl: show root image uuid * Find a solution for SMACK capabilities stuff: https://lists.freedesktop.org/archives/systemd-devel/2014-December/026188.html * synchronize console access with BSD locks: https://lists.freedesktop.org/archives/systemd-devel/2014-October/024582.html * as soon as we have sender timestamps, revisit coalescing multiple parallel daemon reloads: https://lists.freedesktop.org/archives/systemd-devel/2014-December/025862.html * figure out when we can use the coarse timers * maybe allow timer units with an empty Units= setting, so that they can be used for resuming the system but nothing else. * what to do about udev db binary stability for apps? (raw access is not an option) * exponential backoff in timesyncd when we cannot reach a server * timesyncd: add ugly bus calls to set NTP servers per-interface, for usage by NM * add systemd.abort_on_kill or some other such flag to send SIGABRT instead of SIGKILL (throughout the codebase, not only PID1) * drop nss-myhostname in favour of nss-resolve? * resolved: - mDNS/DNS-SD - service registration - service/domain/types browsing - avahi compat - DNS-SD service registration from socket units - resolved should optionally register additional per-interface LLMNR names, so that for the container case we can establish the same name (maybe "host") for referencing the server, everywhere. - allow clients to request DNSSEC for a single lookup even if DNSSEC is off (?) - hook up resolved with machined-based address resolution * refcounting in sd-resolve is borked * add new gpt type for btrfs volumes * generator that automatically discovers btrfs subvolumes, identifies their purpose based on some xattr on them. * a way for container managers to turn off getty starting via $container_headless= or so... * figure out a nice way how we can let the admin know what child/sibling unit causes cgroup membership for a specific unit * For timer units: add some mechanisms so that timer units that trigger immediately on boot do not have the services they run added to the initial transaction and thus confuse Type=idle. * add bus api to query unit file's X fields. * gpt-auto-generator: - Define new partition type for encrypted swap? Support probed LUKS for encrypted swap? - Make /home automount rather than mount? * add generator that pulls in systemd-network from containers when CAP_NET_ADMIN is set, more than the loopback device is defined, even when it is otherwise off * MessageQueueMessageSize= (and suchlike) should use parse_iec_size(). * implement Distribute= in socket units to allow running multiple service instances processing the listening socket, and open this up for ReusePort= * cgroups: - implement per-slice CPUFairScheduling=1 switch - introduce high-level settings for RT budget, swappiness - how to reset dynamically changed unit cgroup attributes sanely? - when reloading configuration, apply new cgroup configuration - when recursively showing the cgroup hierarchy, optionally also show the hierarchies of child processes - add settings for cgroup.max.descendants and cgroup.max.depth, maybe use them for user@.service * transient units: - add field to transient units that indicate whether systemd or somebody else saves/restores its settings, for integration with libvirt * libsystemd-journal, libsystemd-login, libudev: add calls to easily attach these objects to sd-event event loops * be more careful what we export on the bus as (usec_t) 0 and (usec_t) -1 * rfkill,backlight: we probably should run the load tools inside of the udev rules so that the state is properly initialized by the time other software sees it * If we try to find a unit via a dangling symlink, generate a clean error. Currently, we just ignore it and read the unit from the search path anyway. * refuse boot if /usr/lib/os-release is missing or /etc/machine-id cannot be set up * man: the documentation of Restart= currently is very misleading and suggests the tools from ExecStartPre= might get restarted. * There's currently no way to cancel fsck (used to be possible via C-c or c on the console) * add option to sockets to avoid activation. Instead just drop packets/connections, see http://cyberelk.net/tim/2012/02/15/portreserve-systemd-solution/ * make sure systemd-ask-password-wall does not shutdown systemd-ask-password-console too early * verify that the AF_UNIX sockets of a service in the fs still exist when we start a service in order to avoid confusion when a user assumes starting a service is enough to make it accessible * Make it possible to set the keymap independently from the font on the kernel cmdline. Right now setting one resets also the other. * and a dbus call to generate target from current state * investigate whether the gnome pty helper should be moved into systemd, to provide cgroup support. * dot output for --test showing the 'initial transaction' * be able to specify a forced restart of service A where service B depends on, in case B needs to be auto-respawned? * pid1: - When logging about multiple units (stopping BoundTo units, conflicts, etc.), log both units as UNIT=, so that journalctl -u triggers on both. - generate better errors when people try to set transient properties that are not supported... https://lists.freedesktop.org/archives/systemd-devel/2015-February/028076.html - recreate systemd's D-Bus private socket file on SIGUSR2 - when we automatically restart a service, ensure we restart its rdeps, too. - hide PAM options in fragment parser when compile time disabled - Support --test based on current system state - If we show an error about a unit (such as not showing up) and it has no Description string, then show a description string generated form the reverse of unit_name_mangle(). - after deserializing sockets in socket.c we should reapply sockopts and things - drop PID 1 reloading, only do reexecing (difficult: Reload() currently is properly synchronous, Reexec() is weird, because we cannot delay the response properly until we are back, so instead of being properly synchronous we just keep open the fd and close it when done. That means clients do not get a successful method reply, but much rather a disconnect on success. - when breaking cycles drop sysv services first, then services from /run, then from /etc, then from /usr - when a bus name of a service disappears from the bus make sure to queue further activation requests - maybe introduce CoreScheduling=yes/no to optionally set a PR_SCHED_CORE cookie, so that all processes in a service's cgroup share the same cookie and are guaranteed not to share SMT cores with other units https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/hw-vuln/core-scheduling.rst - ExtensionImages= deduplication for services is currently only applied to disk images without GPT envelope. This should be extended to work with proper DDIs too, as well as directory confext/sysext. Moreover, system-wide confext/sysext should support this too. - Pin the mount namespace via FD by sending it back from sd-exec to the manager, and use it for live mounting, instead of doing it via PID * unit files: - allow port=0 in .socket units - maybe introduce ExecRestartPre= - implement Register= switch in .socket units to enable registration in Avahi, RPC and other socket registration services. - allow Type=simple with PIDFile= https://bugzilla.redhat.com/show_bug.cgi?id=723942 - allow writing multiple conditions in unit files on one line - introduce Type=pid-file - add a concept of RemainAfterExit= to scope units - Allow multiple ExecStart= for all Type= settings, so that we can cover rescue.service nicely - add verification of [Install] section to systemd-analyze verify * timer units: - timer units should get the ability to trigger when DST changes - Modulate timer frequency based on battery state * add libsystemd-password or so to query passwords during boot using the password agent logic * clean up date formatting and parsing so that all absolute/relative timestamps we format can also be parsed * on shutdown: move utmp, wall, audit logic all into PID 1 (or logind?), get rid of systemd-update-utmp-runlevel * make repeated alt-ctrl-del presses printing a dump * currently x-systemd.timeout is lost in the initrd, since crypttab is copied into dracut, but fstab is not * add a pam module that on password changes updates any LUKS slot where the password matches * test/: - add unit tests for config_parse_device_allow() * seems that when we follow symlinks to units we prefer the symlink destination path over /etc and /usr. We should not do that. Instead /etc should always override /run+/usr and also any symlink destination. * when isolating, try to figure out a way how we implicitly can order all units we stop before the isolating unit... * teach ConditionKernelCommandLine= globs or regexes (in order to match foobar={no,0,off}) * Add ConditionDirectoryNotEmpty= handle non-absoute paths as a search path or add ConditionConfigSearchPathNotEmpty= or different syntax? See the discussion starting at https://github.com/systemd/systemd/pull/15109#issuecomment-607740136. * BootLoaderSpec: Define a way how an installer can figure out whether a BLS compliant boot loader is installed. * think about requeuing jobs when daemon-reload is issued? use case: the initrd issues a reload after fstab from the host is accessible and we might want to requeue the mounts local-fs acquired through that automatically. * systemd-inhibit: make taking delay locks useful: support sending SIGINT or SIGTERM on PrepareForSleep() * remove any syslog support from log.c — we probably cannot do this before split-off udev is gone for good * shutdown logging: store to EFI var, and store to USB stick? * merge unit_kill_common() and unit_kill_context() * add a dependency on standard-conf.xml and other included files to man pages * MountFlags=shared acts as MountFlags=slave right now. * properly handle loop back mounts via fstab, especially regards to fsck/passno * initialize the hostname from the fs label of /, if /etc/hostname does not exist? * sd-bus: - EBADSLT handling - GetAllProperties() on a non-existing object does not result in a failure currently - port to sd-resolve for connecting to TCP dbus servers - see if we can introduce a new sd_bus_get_owner_machine_id() call to retrieve the machine ID of the machine of the bus itself - see if we can drop more message validation on the sending side - add API to clone sd_bus_message objects - longer term: priority inheritance - dbus spec updates: - NameLost/NameAcquired obsolete - path escaping - update systemd.special(7) to mention that dbus.socket is only about the compatibility socket now * sd-event - allow multiple signal handlers per signal? - document chaining of signal handler for SIGCHLD and child handlers - define more intervals where we will shift wakeup intervals around in, 1h, 6h, 24h, ... - maybe support iouring as backend, so that we allow hooking read and write operations instead of IO ready events into event loops. See considerations here: http://blog.vmsplice.net/2020/07/rethinking-event-loop-integration-for.html * dbus: when a unit failed to load (i.e. is in UNIT_ERROR state), we should be able to safely try another attempt when the bus call LoadUnit() is invoked. * document org.freedesktop.MemoryAllocation1 * maybe do not install getty@tty1.service symlink in /etc but in /usr? * print a nicer explanation if people use variable/specifier expansion in ExecStart= for the first word * mount: turn dependency information from /proc/self/mountinfo into dependency information between systemd units. * EFI: - honor language efi variables for default language selection (if there are any?) - honor timezone efi variables for default timezone selection (if there are any?) * bootctl - recognize the case when not booted on EFI * bootctl: - show whether UEFI audit mode is available - teach it to prepare an ESP wholesale, i.e. with mkfs.vfat invocation - teach it to copy in unified kernel images and maybe type #1 boot loader spec entries from host * logind: - logind: optionally, ignore idle-hint logic for autosuspend, block suspend as long as a session is around - logind: wakelock/opportunistic suspend support - Add pretty name for seats in logind - logind: allow showing logout dialog from system? - add Suspend() bus calls which take timestamps to fix double suspend issues when somebody hits suspend and closes laptop quickly. - if pam_systemd is invoked by su from a process that is outside of a any session we should probably just become a NOP, since that's usually not a real user session but just some system code that just needs setuid(). - logind: make the Suspend()/Hibernate() bus calls wait for the for the job to be completed. before returning, so that clients can wait for "systemctl suspend" to finish to know when the suspending is complete. - logind: when the power button is pressed short, just popup a logout dialog. If it is pressed for 1s, do the usual shutdown. Inspiration are Macs here. - expose "Locked" property on logind session objects - maybe allow configuration of the StopTimeout for session scopes - rename session scope so that it includes the UID. THat way the session scope can be arranged freely in slices and we don't have make assumptions about their slice anymore. - follow PropertiesChanged state more closely, to deal with quick logouts and relogins - (optionally?) spawn seat-manager@$SEAT.service whenever a seat shows up that as CanGraphical set * move multiseat vid/pid matches from logind udev rule to hwdb * delay activation of logind until somebody logs in, or when /dev/tty0 pulls it in or lingering is on (so that containers don't bother with it until PAM is used). also exit-on-idle * journal: - consider introducing implicit _TTY= + _PPID= + _EUID= + _EGID= + _FSUID= + _FSGID= fields - journald: also get thread ID from client, plus thread name - journal: when waiting for journal additions in the client always sleep at least 1s or so, in order to minimize wakeups - add API to close/reopen/get fd for journal client fd in libsystemd-journal. - fall back to /dev/log based logging in libsystemd-journal, if we cannot log natively? - declare the local journal protocol stable in the wiki interface chart - sd-journal: speed up sd_journal_get_data() with transparent hash table in bg - journald: when dropping msgs due to ratelimit make sure to write "dropped %u messages" not only when we are about to print the next message that works, but already after a short timeout - check if we can make journalctl by default use --follow mode inside of less if called without args? - maybe add API to send pairs of iovecs via sd_journal_send - journal: add a setgid "systemd-journal" utility to invoke from libsystemd-journal, which passes fds via STDOUT and does PK access - journalctl: support negative filtering, i.e. FOOBAR!="waldo", and !FOOBAR for events without FOOBAR. - journal: store timestamp of journal_file_set_offline() in the header, so it is possible to display when the file was last synced. - journal-send.c, log.c: when the log socket is clogged, and we drop, count this and write a message about this when it gets unclogged again. - journal: find a way to allow dropping history early, based on priority, other rules - journal: When used on NFS, check payload hashes - journald: add kernel cmdline option to disable ratelimiting for debug purposes - refuse taking lower-case variable names in sd_journal_send() and friends. - journald: we currently rotate only after MaxUse+MaxFilesize has been reached. - journal: deal nicely with byte-by-byte copied files, especially regards header - journal: sanely deal with entries which are larger than the individual file size, but where the components would fit - Replace utmp, wtmp, btmp, and lastlog completely with journal - journalctl: instead --after-cursor= maybe have a --cursor=XYZ+1 syntax? - when a kernel driver logs in a tight loop, we should ratelimit that too. - journald: optionally, log debug messages to /run but everything else to /var - journald: when we drop syslog messages because the syslog socket is full, make sure to write how many messages are lost as first thing to syslog when it works again. - journald: allow per-priority and per-service retention times when rotating/vacuuming - journald: make use of uid-range.h to manage uid ranges to split journals in. - journalctl: add the ability to look for the most recent process of a binary. journalctl /usr/bin/X11 --invocation=-1 - systemctl: change 'status' to show logs for the last invocation, not a fixed number of lines - systemctl: expand --wait to show logs for the invocation with a new switch - improve journalctl performance by loading journal files lazily. Encode just enough information in the file name, so that we do not have to open it to know that it is not interesting for us, for the most common operations. - man: document that corrupted journal files is nothing to act on - rework journald sigbus stuff to use mutex - Set RLIMIT_NPROC for systemd-journal-xyz, and all other of our services that run under their own user ids, and use User= (but only in a world where userns is ubiquitous since otherwise we cannot invoke those daemons on the host AND in a container anymore). Also, if LimitNPROC= is used without User= we should warn and refuse operation. - journalctl --verify: don't show files that are currently being written to as FAIL, but instead show that they are being written to. - add journalctl -H that talks via ssh to a remote peer and passes through binary logs data - add a version of --merge which also merges /var/log/journal/remote - journalctl: -m should access container journals directly by enumerating them via machined, and also watch containers coming and going. Benefit: nspawn --ephemeral would start working nicely with the journal. - assign MESSAGE_ID to log messages about failed services - check if loop in decompress_blob_xz() is necessary * journald: support RFC3164 fully for the incoming syslog transport, see https://github.com/systemd/systemd/issues/19251#issuecomment-816601955 * Hook up journald's FSS logic with TPM2: seal the verification disk by time-based policy, so that the verification key can remain on host and ve validated via TPM. * rework journalctl -M to be based on a machined method that generates a mount fd of the relevant journal dirs in the container with uidmapping applied to allow the host to read it, while making everything read-only. * journald: add varlink service that allows subscribing to certain log events, for example matching by message ID, or log level returns a list of journal cursors as they happen. * journald: also collect CLOCK_BOOTTIME timestamps per log entry. Then, derive "corrected" CLOCK_REALTIME information on display from that and the timestamp info of the newest entry of the specific boot (as identified by the boot ID). This way, if a system comes up without a valid clock but acquires a better clock later, we can "fix" older entry timestamps on display, by calculating backwards. We cannot use CLOCK_MONOTONIC for this, since it does not account for suspend phases. This would then also enable us to correct the kmsg timestamping we consume (where we erroneously assume the clock was in CLOCK_MONOTONIC, but it actually is CLOCK_BOOTTIME as per kernel). * in journald, write out a recognizable log record whenever the system clock is changed ("stepped"), and in timesyncd whenever we acquire an NTP fix ("slewing"). Then, in journalctl for each boot time we come across, find these records, and use the structured info they include to display "corrected" wallclock time, as calculated from the monotonic timestamp in the log record, adjusted by the delta declared in the structured log record. * in journald: whenever we start a new journal file because the boot ID changed, let's generate a recognizable log record containing info about old and new ID. Then, when displaying log stream in journalctl look for these records, to be able to order them. * journald: generate recognizable log events whenever we shutdown journald cleanly, and when we migrate run → var. This way tools can verify that a previous boot terminated cleanly, because either of these two messages must be safely written to disk, then. * hook up journald with TPMs? measure new journal records to the TPM in regular intervals, validate the journal against current TPM state with that. (taking inspiration from IMA log) * sd-journal puts a limit on parallel journal files to view at once. journald should probably honour that same limit (JOURNAL_FILES_MAX) when vacuuming to ensure we never generate more files than we can actually view. * bsod: maybe use graphical mode. Use DRM APIs directly, see https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c for an example for doing that. * maybe implicitly attach monotonic+realtime timestamps to outgoing messages in log.c and sd-journal-send * journalctl/timesyncd: whenever timesyncd acquires a synchronization from NTP, create a structured log entry that contains boot ID, monotonic clock and realtime clock (I mean, this requires no special work, as these three fields are implicit). Then in journalctl when attempting to display the realtime timestamp of a log entry, first search for the closest later log entry of this kinda that has a matching boot id, and convert the monotonic clock timestamp of the entry to the realtime clock using this info. This way we can retroactively correct the wallclock timestamps, in particular for systems without RTC, i.e. where initially wallclock timestamps carry rubbish, until an NTP sync is acquired. * introduce per-unit (i.e. per-slice, per-service) journal log size limits. * journald: do journal file writing out-of-process, with one writer process per client UID, so that synthetic hash table collisions can slow down a specific user's journal stream down but not the others. * tweak journald context caching. In addition to caching per-process attributes keyed by PID, cache per-cgroup attributes (i.e. the various xattrs we read) keyed by cgroup path, and guarded by ctime changes. This should provide us with a nice speed-up on services that have many processes running in the same cgroup. * maybe add call sd_journal_set_block_timeout() or so to set SO_SNDTIMEO for the sd-journal logging socket, and, if the timeout is set to 0, sets O_NONBLOCK on it. That way people can control if and when to block for logging. * journalctl: make sure -f ends when the container indicated by -M terminates * journald: sigbus API via a signal-handler safe function that people may call from the SIGBUS handler * add a test if all entries in the catalog are properly formatted. (Adding dashes in a catalog entry currently results in the catalog entry being silently skipped. journalctl --update-catalog must warn about this, and we should also have a unit test to check that all our message are OK.) * build short web pages out of each catalog entry, build them along with man pages, and include hyperlinks to them in the journal output * homed: - when user tries to log into record signed by unrecognized key, automatically add key to our chain after polkit auth - rollback when resize fails mid-operation - GNOME's side for forget key on suspend (requires rework so that lock screen runs outside of uid) - update LUKS password on login if we find there's a password that unlocks the JSON record but not the LUKS device. - create on activate? - properties: icon url?, administrator bool (which translates to 'wheel' membership)?, address?, telephone?, vcard?, samba stuff?, parental controls? - communicate clearly when usb stick is safe to remove. probably involves beefing up logind to make pam session close hook synchronous and wait until systemd --user is shut down. - logind: maybe keep a "busy fd" as long as there's a non-released session around or the user@.service - maybe make automatic, read-only, time-based reflink-copies of LUKS disk images (and btrfs snapshots of subvolumes) (think: time machine) - distinguish destroy / remove (i.e. currently we can unregister a user, unregister+remove their home directory, but not just remove their home directory) - in systemd's PAMName= logic: query passwords with ssh-askpassword, so that we can make "loginctl set-linger" mode work - fingerprint authentication, pattern authentication, … - make sure "classic" user records can also be managed by homed - make size of $XDG_RUNTIME_DIR configurable in user record - move acct mgmt stuff from pam_systemd_home to pam_systemd? - when "homectl --pkcs11-token-uri=" is used, synthesize ssh-authorized-keys records for all keys we have private keys on the stick for - make slice for users configurable (requires logind rework) - logind: populate auto-login list bus property from PKCS#11 token - when determining state of a LUKS home directory, check DM suspended sysfs file - when homed is in use, maybe start the user session manager in a mount namespace with MS_SLAVE, so that mounts propagate down but not up - eg, user A setting up a backup volume doesn't mean user B sees it - use credentials logic/TPM2 logic to store homed signing key - permit multiple user record signing keys to be used locally, and pick the right one for signing records automatically depending on a pre-existing signature - add a way to "adopt" a home directory, i.e. strip foreign signatures and insert a local signature instead. - as an extension to the directory+subvolume backend: if located on especially marked fs, then sync down password into LUKS header of that fs, and always verify passwords against it too. Bootstrapping is a problem though: if no one is logged in (or no other user even exists yet), how do you unlock the volume in order to create the first user and add the first pw. - support new FS_IOC_ADD_ENCRYPTION_KEY ioctl for setting up fscrypt - maybe pre-create ~/.cache as subvol so that it can have separate quota easily? - store PKCS#11 + FIDO2 token info in LUKS2 header, compatible with systemd-cryptsetup, so that it can unlock homed volumes - maybe make all *.home files owned by `systemd-home` user or so, so that we can easily set overall quota for all users - on login, if we can't fallocate initially, but rebalance is on, then allow login in discard mode, then immediately rebalance, then turn off discard - add "homectl unbind" command to remove local user record of an inactive home dir * add a new switch --auto-definitions=yes/no or so to systemd-repart. If specified, synthesize a definition automatically if we can: enlarge last partition on disk, but only if it is marked for growing and not read-only. * systemd-repart: read LUKS encryption key from $CREDENTIALS_DIRECTORY * systemd-repart: add a switch to factory reset the partition table without immediately applying the new configuration again. i.e. --factory-reset=leave or so. (this is useful to factory reset an image, then putting it into another machine, ensuring that luks key is generated on new machine, not old) * systemd-repart: support setting up dm-integrity with HMAC * systemd-repart: maybe remove half-initialized image on failure. It fails if the output file exists, so a repeated invocation will usually fail if something goes wrong on the way. * systemd-repart: by default generate minimized partition tables (i.e. tables that only cover the space actually used, excluding any free space at the end), in order to maximize dd'ability. Requires libfdisk work, see https://github.com/karelzak/util-linux/issues/907 * systemd-repart: MBR partition table support. Care needs to be taken regarding Type=, so that partition definitions can sanely apply to both the GPT and the MBR case. Idea: accept syntax "Type=gpt:home mbr:0x83" for setting the types for the two partition types explicitly. And provide an internal mapping so that "Type=linux-generic" maps to the right types for both partition tables automatically. * systemd-repart: allow sizing partitions as factor of available RAM, so that we can reasonably size swap partitions for hibernation. * systemd-repart: allow boolean option that ensures that if existing partition doesn't exist within the configured size bounds the whole command fails. This is useful to implement ESP vs. XBOOTLDR schemes in installers: have one set of repart files for the case where ESP is large enough and one where it isn't and XBOOTLDR is added in instead. Then apply the former first, and if it fails to apply use the latter. * systemd-repart: add per-partition option to never reuse existing partition and always create anew even if matching partition already exists. * systemd-repart: add per-partition option to fail if partition already exist, i.e. is not added new. Similar, add option to fail if partition does not exist yet. * systemd-repart: allow disabling growing of specific partitions, or making them (think ESP: we don't ever want to grow it, since we cannot resize vfat) Also add option to disable operation via kernel command line. * systemd-repart: make it a static checker during early boot for existence and absence of other partitions for trusted boot environments * systemd-repart: add support for SD_GPT_FLAG_GROWFS also on real systems, i.e. generate some unit to actually enlarge the fs after growing the partition during boot. * systemd-repart: do not print "Successfully resized …" when no change was done. * document: - document that deps in [Unit] sections ignore Alias= fields in [Install] units of other units, unless those units are disabled - man: clarify that time-sync.target is not only sysv compat but also useful otherwise. Same for similar targets - document that service reload may be implemented as service reexec - add a man page containing packaging guidelines and recommending usage of things like Documentation=, PrivateTmp=, PrivateNetwork= and ReadOnlyDirectories=/etc /usr. - document systemd-journal-flush.service properly - documentation: recommend to connect the timer units of a service to the service via Also= in [Install] - man: document the very specific env the shutdown drop-in tools live in - man: add more examples to man pages, - in particular an example how to do the equivalent of switching runlevels - man: maybe sort directives in man pages, and take sections from --help and apply them to man too - document root=gpt-auto properly * systemctl: - add systemctl switch to dump transaction without executing it - Add a verbose mode to "systemctl start" and friends that explains what is being done or not done - print nice message from systemctl --failed if there are no entries shown, and hook that into ExecStartPre of rescue.service/emergency.service - add new command to systemctl: "systemctl system-reexec" which reexecs as many daemons as virtually possible - systemctl enable: fail if target to alias into does not exist? maybe show how many units are enabled afterwards? - systemctl: "Journal has been rotated since unit was started." message is misleading * introduce an option (or replacement) for "systemctl show" that outputs all properties as JSON, similar to busctl's new JSON output. In contrast to that it should skip the variant type string though. * Add a "systemctl list-units --by-slice" mode or so, which rearranges the output of "systemctl list-units" slightly by showing the tree structure of the slices, and the units attached to them. * add "systemctl wait" or so, which does what "systemd-run --wait" does, but for all units. It should be both a way to pin units into memory as well as a wait to retrieve their exit data. * show whether a service has out-of-date configuration in "systemctl status" by using mtime data of ConfigurationDirectory=. * "systemctl preset-all" should probably order the unit files it operates on lexicographically before starting to work, in order to ensure deterministic behaviour if two unit files conflict (like DMs do, for example) * add "systemctl start -v foobar.service" that shows logs of a service while the start command runs. This is non-trivial to do without races though, since we should flush out all journal messages before returning from the "systemctl stop". * systemctl: if some operation fails, show log output? * Add a new verb "systemctl top" * unit install: - "systemctl mask" should find all names by which a unit is accessible (i.e. by scanning for symlinks to it) and link them all to /dev/null * nspawn: - emulate /dev/kmsg using CUSE and turn off the syslog syscall with seccomp. That should provide us with a useful log buffer that systemd can log to during early boot, and disconnect container logs from the kernel's logs. - as soon as networkd has a bus interface, hook up --network-interface=, --network-bridge= with networkd, to trigger netdev creation should an interface be missing - a nice way to boot up without machine id set, so that it is set at boot automatically for supporting --ephemeral. Maybe hash the host machine id together with the machine name to generate the machine id for the container - fix logic always print a final newline on output. https://github.com/systemd/systemd/pull/272#issuecomment-113153176 - should optionally support receiving WATCHDOG=1 messages from its payload PID 1... - optionally automatically add FORWARD rules to iptables whenever nspawn is running, remove them when shut down. - add support for sysext extensions, too. i.e. a new --extension= switch that takes one or more arguments, and applies the extensions already during startup. - when main nspawn supervisor process gets suspended due to SIGSTOP/SIGTTOU or so, freeze the payload too. - support time namespaces - on cgroupsv1 issue cgroup empty handler process based on host events, so that we make cgroup agent logic safe - add API to invoke binary in container, then use that as fallback in "machinectl shell" - make nspawn suitable for shell pipelines: instead of triggering a hangup when input is finished, send ^D, which synthesizes an EOF. Then wait for hangup or ^D before passing on the EOF. - greater control over selinux label? - support that /proc, /sys/, /dev are pre-mounted - maybe allow TPM passthrough, backed by swtpm, and measure --image= hash into its PCR 11, so that nspawn instances can be TPM enabled, and partake in measurements/remote attestation and such. swtpm would run outside of control of container, and ideally would itself bind its encryption keys to host TPM. - make boot assessment do something sensible in a container. i.e send an sd_notify() from payload to container manager once boot-up is completed successfully, and use that in nspawn for dealing with boot counting, implemented in the partition table labels and directory names. - optionally set up nftables/iptables routes that forward UDP/TCP traffic on port 53 to resolved stub 127.0.0.54 - maybe optionally insert .nspawn file as GPT partition into images, so that such container images are entirely stand-alone and can be updated as one. - The subreaper logic we currently have seems overly complex. We should investigate whether creating the inner child with CLONE_PARENT isn't better. - Reduce the number of sockets that are currently in use and just rely on one or two sockets. - Support running nspawn as an unprivileged user. * machined: - add an API so that libvirt-lxc can inform us about network interfaces being removed or added to an existing machine - "machinectl migrate" or similar to copy a container from or to a difference host, via ssh - introduce systemd-nspawn-ephemeral@.service, and hook it into "machinectl start" with a new --ephemeral switch - "machinectl status" should also show internal logs of the container in question - "machinectl history" - "machinectl diff" - "machinectl commit" that takes a writable snapshot of a tree, invokes a shell in it, and marks it read-only after use * udev: - move to LGPL - kill scsi_id - add trigger --subsystem-match=usb/usb_device device - reimport udev db after MOVE events for devices without dev_t - re-enable ProtectClock= once only cgroupsv2 is supported. See f562abe2963bad241d34e0b308e48cf114672c84. * coredump: - save coredump in Windows/Mozilla minidump format - when truncating coredumps, also log the full size that the process had, and make a metadata field so we can report truncated coredumps - add examples for other distros in ELF_PACKAGE_METADATA * support crash reporting operation modes (https://live.gnome.org/GnomeOS/Design/Whiteboards/ProblemReporting) * tmpfiles: - allow time-based cleanup in r and R too - instead of ignoring unknown fields, reject them. - creating new directories/subvolumes/fifos/device nodes should not follow symlinks. None of the other adjustment or creation calls follow symlinks. - teach tmpfiles.d q/Q logic something sensible in the context of XFS/ext4 project quota - teach tmpfiles.d m/M to move / atomic move + symlink old -> new - add new line type for setting btrfs subvolume attributes (i.e. rw/ro) - tmpfiles: add new line type for setting fcaps - add -n as shortcut for --dry-run in tmpfiles & sysusers & possibly other places * udev-link-config: - Make sure ID_PATH is always exported and complete for network devices where possible, so we can safely rely on Path= matching * sd-rtnl: - add support for more attribute types - inbuilt piping support (essentially degenerate async)? see loopback-setup.c and other places * networkd: - add more keys to [Route] and [Address] sections - add support for more DHCPv4 options (and, longer term, other kinds of dynamic config) - add reduced [Link] support to .network files - properly handle routerless dhcp leases - work with non-Ethernet devices - dhcp: do we allow configuring dhcp routes on interfaces that are not the one we got the dhcp info from? - the DHCP lease data (such as NTP/DNS) is still made available when a carrier is lost on a link. It should be removed instantly. - expose in the API the following bits: - option 15, domain name - option 12, hostname and/or option 81, fqdn - option 123, 144, geolocation - option 252, configure http proxy (PAC/wpad) - provide a way to define a per-network interface default metric value for all routes to it. possibly a second default for DHCP routes. - allow Name= to be specified repeatedly in the [Match] section. Maybe also support Name=foo*|bar*|baz ? - whenever uplink info changes, make DHCP server send out FORCERENEW * in networkd, when matching device types, fix up DEVTYPE rubbish the kernel passes to us * Figure out how to do unittests of networkd's state serialization * dhcp: - figure out how much we can increase Maximum Message Size * dhcp6: - add functions to set previously stored IPv6 addresses on startup and get them at shutdown; store them in client->ia_na - write more test cases - implement reconfigure support, see 5.3., 15.11. and 22.20. - implement support for temporary addresses (IA_TA) - implement dhcpv6 authentication - investigate the usefulness of Confirm messages; i.e. are there any situations where the link changes without any loss in carrier detection or interface down - some servers don't do rapid commit without a filled in IA_NA, verify this behavior - RouteTable= ? * shared/wall: Once more programs are taught to prefer sd-login over utmp, switch the default wall implementation to wall_logind (https://github.com/systemd/systemd/pull/29051#issuecomment-1704917074)