IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Fedora 28 is out already, let's advertise it. While at it, drop "container"
from "f28container" — it's a subdirectory under /var/lib/machines, it's pretty
obvious that's it a container.
To make the switch easier in the future, define the number as an entity.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.
This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.
This ensures we set the various resource limits of our container
explicitly on each invocation so that we inherit less from our callers
into the payload.
By default resource limits are now set to the same values Linux
generally passes to the host PID 1, thus minimizing needless differences
between host and container environments.
The limits are now also configurable using a new --rlimit= switch. This
is preparation for teaching nspawn native OCI runtime support as OCI
permits setting resource limits for container payloads, and it hence
probably makes sense if we do too.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
Nowadays people use systemd on many different architectures, so we
shouldn't presuppose that they are using amd64. debootstrap defaults
to the native architecture and this should be good enough.
Systemd services are permitted to be scripts, as well as binary
executables.
The same also applies to the underlying /sbin/mount and /sbin/swapon.
It is not necessary for the user to consider what type of program file
these are. Nor is it necessary with systemd-nspawn, to distinguish between
init as a "binary" v.s. a user-specified "program".
Also fix a couple of grammar nits in the modified sentences.
Add a new option `--network-namespace-path` to systemd-nspawn to allow
users to specify an arbitrary network namespace, e.g. `/run/netns/foo`.
Then systemd-nspawn will open the netns file, pass the fd to
outer_child, and enter the namespace represented by the fd before
running inner_child.
```
$ sudo ip netns add foo
$ mount | grep /run/netns/foo
nsfs on /run/netns/foo type nsfs (rw)
...
$ sudo systemd-nspawn -D /srv/fc27 --network-namespace-path=/run/netns/foo \
/bin/readlink -f /proc/self/ns/net
/proc/1/ns/net:[4026532009]
```
Note that the option `--network-namespace-path=` cannot be used together
with other network-related options such as `--private-network` so that
the options do not conflict with each other.
Fixes https://github.com/systemd/systemd/issues/7361
Let's lock things down a bit, and maintain a list of what's permitted
rather than a list of what's prohibited in nspawn (also to make things a
bit more like Docker and friends).
Note that this slightly alters the effect of --system-call-filter=, as
now the negative list now takes precedence over the positive list.
However, given that the option is just a few days old and not included
in any released version it should be fine to change it at this point in
time.
Note that the whitelist is good chunk more restrictive thatn the
previous blacklist. Specifically:
- fanotify is not permitted (given the buffer size issues it's
problematic in containers)
- nfsservctl is not permitted (NFS server support is not virtualized)
- pkey_xyz stuff is not permitted (really new stuff I don't grok)
- @cpu-emulation is prohibited (untested legacy stuff mostly, and if
people really want to run dosemu in nspawn, they should use
--system-call-filter=@cpu-emulation and all should be good)
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
Previously, only when --register=yes was set (the default) the invoked
container would get its own scope, created by machined on behalf of
nspawn. With this change if --register=no is set nspawn will still get
its own scope (which is a good thing, so that --slice= and --property=
take effect), but this is not done through machined but by registering a
scope unit directly in PID 1.
Summary:
--register=yes → allocate a new scope through machined (the default)
--register=yes --keep-unit → use the unit we are already running in an register with machined
--register=no → allocate a new scope directly, but no machined
--register=no --keep-unit → do not allocate nor register anything
Fixes: #5823
We should try to keep the unbreakable lines below 80 columns.
It's not always possible of course.
Also, use the dl.fp.o alias instead of a specific mirror.
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
This slightly extends the roothash loading logic to first check for a
user.verity.roothash extended attribute on the image file. If it exists,
it is used as Verity root hash and the ".roothash" file is not used.
This should improve the chance that the roothash is retained when the
file is moved around, as the data snippet is attached directly to the
image file. The field is still detached from the file payload however,
in order to make sure it may be trusted independently.
This does not replace the ".roothash" file loading, it simply adds a
second way to retrieve the data.
Extended attributes are often a poor choice for storing metadata like
this as it is usually difficult to discover for admins and users, and
hard to fix if it ever gets out of sync. However, in this case I think
it's safe as verity implies read-only access, and thus there's little
chance of it to get out of sync.
This adds support for a new kernel command line option "systemd.volatile=" that
provides the same functionality that systemd-nspawn's --volatile= switch
provides, but for host systems (i.e. systems booting with a kernel).
It takes the same parameter and has the same effect.
In order to implement systemd.volatile=yes a new service
systemd-volatile-root.service is introduced that only runs in the initrd and
rearranges the root directory as needed to become a tmpfs instance. Note that
systemd.volatile=state is implemented different: it simply generates a
var.mount unit file that is part of the normal boot and has no effect on the
initrd execution.
The way this is implemented ensures that other explicit configuration for /var
can always override the effect of these options. Specifically, the var.mount
unit is generated in the "late" generator directory, so that it only is in
effect if nothing else overrides it.
This extends the --bind= and --overlay= syntax so that an empty string as source/upper
directory is taken as request to automatically allocate a temporary directory
below /var/tmp, whose lifetime is bound to the nspawn runtime. In combination
with the "+" path extension this permits a switch "--overlay=+/var::/var" in
order to use the container's shipped /var, combine it with a writable temporary
directory and mount it to the runtime /var of the container.
If a source path is prefixed with "+" it is taken relative to the container's
root directory instead of the host. This permits easily establishing bind and
overlay mounts based on data from the container rather than the host.
This also reworks custom_mounts_prepare(), and turns it into two functions: one
custom_mount_check_all() that remains in nspawn.c but purely verifies the
validity of the custom mounts configured. And one called
custom_mount_prepare_all() that actually does the preparation step, sorts the
custom mounts, resolves relative paths, and allocates temporary directories as
necessary.
Given that other file systems (notably: xfs) support reflinks these days, let's
extend the file system snapshotting logic to fall back to plan copies or
reflinks when full btrfs subvolume snapshots are not available.
This essentially makes "systemd-nspawn --ephemeral" and "systemd-nspawn
--template=" available on non-btrfs subvolumes. Of course, both operations will
still be slower on non-btrfs than on btrfs (simply because reflinking each file
individually in a directory tree is still slower than doing this in one step
for a whole subvolume), but it's probably good enough for many cases, and we
should provide the users with the tools, they have to figure out what's good
for them.
Note that "machinectl clone" already had a fallback like this in place, this
patch generalizes this, and adds similar support to our other cases.
Previously --ephemeral was only supported with container trees in btrfs
subvolumes (i.e. in combination with --directory=). This adds support for
--ephemeral in conjunction with disk images (i.e. --image=) too.
As side effect this fixes that --ephemeral was accepted but ignored when using
-M on a container that turned out to be an image.
Fixes: #4664
This removes the --share-system switch: from the documentation, the --help text
as well as the command line parsing. It's an ugly option, given that it kinda
contradicts the whole concept of PID namespaces that nspawn implements. Since
it's barely ever used, let's just deprecate it and remove it from the options.
It might be useful as a debugging option, hence the functionality is kept
around for now, exposed via an undocumented $SYSTEMD_NSPAWN_SHARE_SYSTEM
environment variable.
Not as many people use chroot as before, so make the flow a bit nicer by
talking less about chroot.
"change to the either" is awkward and unclear. Just remove that part,
because all changes are lost, period.
This change documents the existance of the systemd-nspawn@.service template
unit file, which was previously not mentioned at all. Since the unit file uses
slightly different default than nspawn invoked from the command line, these
defaults are now explicitly documented too.
A couple of further additions and changes are made, too.
Replaces: #3497
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.
Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.
Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022
Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.
1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
(You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
${HOME}/systemd/systemd-nspawn --notify-ready=yes
-D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"
When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.
This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.
------ sleep.service ------
[Unit]
Description=sleep
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sleep 20
[Install]
WantedBy=multi-user.target
------------ end ------------