IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Explicitly document the behavior introduced in #7437: when picking a new
UID shift base with "-U", a hash of the machine name will be tried
before falling back to fully random UID base candidates.
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.
However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).
This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.
Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.
Also, the setting of ambient capability is restricted to running a
single program in the container payload.
We should avoid duplicating lengthy description of very similar concepts.
--root-hash-sig follows the same semantics as RootHashSig=, so just refer
the reader to the other man page. --root-hash doesn't implement the same
features as RootHash=, so we can't fully replace the description, but let's
give the user a hint to look at the other man page too.
For #17177.
By default we'll run a container in --console=interactive and
--console=read-only mode depending if we are invoked on a tty or not so
that the container always gets a /dev/console allocated, i.e is always
suitable to run a full init system /as those typically expect a
/dev/console to exist).
With the new --console=autopipe mode we do something similar, but
slightly different: when not invoked on a tty we'll use --console=pipe.
This means, if you invoke some tool in a container with this you'll get
full inetractivity if you invoke it on a tty but things will also be
very nicely pipeable. OTOH you cannot invoke a full init system like
this, because you might or might not become a /dev/console this way...
Prompted-by: #17070
(I named this "autopipe" rather than "auto" or so, since the default
mode probably should be named "auto" one day if we add a name for it,
and this is so similar to "auto" except that it uses pipes in the
non-tty case).
Since cryptsetup 2.3.0 a new API to verify dm-verity volumes by a
pkcs7 signature, with the public key in the kernel keyring,
is available. Use it if libcryptsetup supports it.
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
dm-verity support in dissect-image at the moment is restricted to GPT
volumes.
If the image a single-filesystem type without a partition table (eg: squashfs)
and a roothash/verity file are passed, set the verity flag and mark as
read-only.
It's not that I think that "hostname" is vastly superior to "host name". Quite
the opposite — the difference is small, and in some context the two-word version
does fit better. But in the tree, there are ~200 occurrences of the first, and
>1600 of the other, and consistent spelling is more important than any particular
spelling choice.
The default value was described at the end of two long paragraphs.
Make the first para self contained, and move the description of --console=pipe
into the second para.
To access a shell on a disk image, the man page on Fedora-29 says to
run: `systemd-nspawn -M Fedora-Cloud-Base-28-1.1.x86_64.raw`. Let's
try.
List existing images:
$> machinectl list-images | awk '{print $1,$2}';
NAME TYPE
Fedora-Cloud-Base-30… raw
1 images
Now invoke `systemd-nspawn` as noted in the man page:
$> systemd-nspawn -M Fedora-Cloud-Base-30-1.2.x86_64.raw
No image for machine 'Fedora-Cloud-Base-30-1.2.x86_64.raw'.
Removing the ".raw" extension launches the image and gives a shell.
Update the man page to reflect that.
Frantisek Sumsal on #systemd (Freenode) noted the reason: "In older
versions systemd -M accepted both image-name.raw and image-name as a
valid image names, however, on Fedora 29 (systemd-239) with all the
BTRFS stuff around it accepts only -M image-name (without the
extension)"
- - -
While at it, update the fedora_{latest_version, cloud_release}
variables.
Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
No other changes, just some reshuffling and adding of section headers
(well, admittedly, I changed some "see above" and "see below" in the
text to match the new order.)
The "include" files had type "book" for some raeason. I don't think this
is meaningful. Let's just use the same everywhere.
$ perl -i -0pe 's^..DOCTYPE (book|refentry) PUBLIC "-//OASIS//DTD DocBook XML V4.[25]//EN"\s+"http^<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"\n "http^gms' man/*.xml
No need to waste space, and uniformity is good.
$ perl -i -0pe 's|\n+<!--\s*SPDX-License-Identifier: LGPL-2.1..\s*-->|\n<!-- SPDX-License-Identifier: LGPL-2.1+ -->|gms' man/*.xml
nspawn as it is now is a generally useful tool, hence let's drop the
comments about it being useful for debug and so on only.
The new wording just makes the first sentence of the main page also the
summary.
Docbook styles required those to be present, even though the templates that we
use did not show those names anywhere. But something changed semi-recently (I
would suspect docbook templates, but there was only a minor version bump in
recent years, and the changelog does not suggest anything related), and builds
now work without those entries. Let's drop this dead weight.
Tested with F26-F29, debian unstable.
$ perl -i -0pe 's/\s*<authorgroup>.*<.authorgroup>//gms' man/*xml
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Fedora 28 is out already, let's advertise it. While at it, drop "container"
from "f28container" — it's a subdirectory under /var/lib/machines, it's pretty
obvious that's it a container.
To make the switch easier in the future, define the number as an entity.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.
Previously, the container's hostname was exclusively initialized from
the machine name configured with --machine=, i.e. the internal name and
the external name used for and by the container was synchronized. This
adds a new option --hostname= that optionally allows the internal name
to deviate from the external name.
This new option is mainly useful to ultimately implement the OCI runtime
spec directly in nspawn, but it might be useful on its own for some
other usecases too.
This ensures we set the various resource limits of our container
explicitly on each invocation so that we inherit less from our callers
into the payload.
By default resource limits are now set to the same values Linux
generally passes to the host PID 1, thus minimizing needless differences
between host and container environments.
The limits are now also configurable using a new --rlimit= switch. This
is preparation for teaching nspawn native OCI runtime support as OCI
permits setting resource limits for container payloads, and it hence
probably makes sense if we do too.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
Nowadays people use systemd on many different architectures, so we
shouldn't presuppose that they are using amd64. debootstrap defaults
to the native architecture and this should be good enough.
Systemd services are permitted to be scripts, as well as binary
executables.
The same also applies to the underlying /sbin/mount and /sbin/swapon.
It is not necessary for the user to consider what type of program file
these are. Nor is it necessary with systemd-nspawn, to distinguish between
init as a "binary" v.s. a user-specified "program".
Also fix a couple of grammar nits in the modified sentences.
Add a new option `--network-namespace-path` to systemd-nspawn to allow
users to specify an arbitrary network namespace, e.g. `/run/netns/foo`.
Then systemd-nspawn will open the netns file, pass the fd to
outer_child, and enter the namespace represented by the fd before
running inner_child.
```
$ sudo ip netns add foo
$ mount | grep /run/netns/foo
nsfs on /run/netns/foo type nsfs (rw)
...
$ sudo systemd-nspawn -D /srv/fc27 --network-namespace-path=/run/netns/foo \
/bin/readlink -f /proc/self/ns/net
/proc/1/ns/net:[4026532009]
```
Note that the option `--network-namespace-path=` cannot be used together
with other network-related options such as `--private-network` so that
the options do not conflict with each other.
Fixes https://github.com/systemd/systemd/issues/7361
Let's lock things down a bit, and maintain a list of what's permitted
rather than a list of what's prohibited in nspawn (also to make things a
bit more like Docker and friends).
Note that this slightly alters the effect of --system-call-filter=, as
now the negative list now takes precedence over the positive list.
However, given that the option is just a few days old and not included
in any released version it should be fine to change it at this point in
time.
Note that the whitelist is good chunk more restrictive thatn the
previous blacklist. Specifically:
- fanotify is not permitted (given the buffer size issues it's
problematic in containers)
- nfsservctl is not permitted (NFS server support is not virtualized)
- pkey_xyz stuff is not permitted (really new stuff I don't grok)
- @cpu-emulation is prohibited (untested legacy stuff mostly, and if
people really want to run dosemu in nspawn, they should use
--system-call-filter=@cpu-emulation and all should be good)
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
Previously, only when --register=yes was set (the default) the invoked
container would get its own scope, created by machined on behalf of
nspawn. With this change if --register=no is set nspawn will still get
its own scope (which is a good thing, so that --slice= and --property=
take effect), but this is not done through machined but by registering a
scope unit directly in PID 1.
Summary:
--register=yes → allocate a new scope through machined (the default)
--register=yes --keep-unit → use the unit we are already running in an register with machined
--register=no → allocate a new scope directly, but no machined
--register=no --keep-unit → do not allocate nor register anything
Fixes: #5823
We should try to keep the unbreakable lines below 80 columns.
It's not always possible of course.
Also, use the dl.fp.o alias instead of a specific mirror.
Add a new --pivot-root argument to systemd-nspawn, which specifies a
directory to pivot to / inside the container; while the original / is
pivoted to another specified directory (if provided). This adds
support for booting container images which may contain several bootable
sysroots, as is common with OSTree disk images. When these disk images
are booted on real hardware, ostree-prepare-root is run in conjunction
with sysroot.mount in the initramfs to achieve the same results.
This slightly extends the roothash loading logic to first check for a
user.verity.roothash extended attribute on the image file. If it exists,
it is used as Verity root hash and the ".roothash" file is not used.
This should improve the chance that the roothash is retained when the
file is moved around, as the data snippet is attached directly to the
image file. The field is still detached from the file payload however,
in order to make sure it may be trusted independently.
This does not replace the ".roothash" file loading, it simply adds a
second way to retrieve the data.
Extended attributes are often a poor choice for storing metadata like
this as it is usually difficult to discover for admins and users, and
hard to fix if it ever gets out of sync. However, in this case I think
it's safe as verity implies read-only access, and thus there's little
chance of it to get out of sync.
This adds support for a new kernel command line option "systemd.volatile=" that
provides the same functionality that systemd-nspawn's --volatile= switch
provides, but for host systems (i.e. systems booting with a kernel).
It takes the same parameter and has the same effect.
In order to implement systemd.volatile=yes a new service
systemd-volatile-root.service is introduced that only runs in the initrd and
rearranges the root directory as needed to become a tmpfs instance. Note that
systemd.volatile=state is implemented different: it simply generates a
var.mount unit file that is part of the normal boot and has no effect on the
initrd execution.
The way this is implemented ensures that other explicit configuration for /var
can always override the effect of these options. Specifically, the var.mount
unit is generated in the "late" generator directory, so that it only is in
effect if nothing else overrides it.
This extends the --bind= and --overlay= syntax so that an empty string as source/upper
directory is taken as request to automatically allocate a temporary directory
below /var/tmp, whose lifetime is bound to the nspawn runtime. In combination
with the "+" path extension this permits a switch "--overlay=+/var::/var" in
order to use the container's shipped /var, combine it with a writable temporary
directory and mount it to the runtime /var of the container.
If a source path is prefixed with "+" it is taken relative to the container's
root directory instead of the host. This permits easily establishing bind and
overlay mounts based on data from the container rather than the host.
This also reworks custom_mounts_prepare(), and turns it into two functions: one
custom_mount_check_all() that remains in nspawn.c but purely verifies the
validity of the custom mounts configured. And one called
custom_mount_prepare_all() that actually does the preparation step, sorts the
custom mounts, resolves relative paths, and allocates temporary directories as
necessary.
Given that other file systems (notably: xfs) support reflinks these days, let's
extend the file system snapshotting logic to fall back to plan copies or
reflinks when full btrfs subvolume snapshots are not available.
This essentially makes "systemd-nspawn --ephemeral" and "systemd-nspawn
--template=" available on non-btrfs subvolumes. Of course, both operations will
still be slower on non-btrfs than on btrfs (simply because reflinking each file
individually in a directory tree is still slower than doing this in one step
for a whole subvolume), but it's probably good enough for many cases, and we
should provide the users with the tools, they have to figure out what's good
for them.
Note that "machinectl clone" already had a fallback like this in place, this
patch generalizes this, and adds similar support to our other cases.
Previously --ephemeral was only supported with container trees in btrfs
subvolumes (i.e. in combination with --directory=). This adds support for
--ephemeral in conjunction with disk images (i.e. --image=) too.
As side effect this fixes that --ephemeral was accepted but ignored when using
-M on a container that turned out to be an image.
Fixes: #4664
This removes the --share-system switch: from the documentation, the --help text
as well as the command line parsing. It's an ugly option, given that it kinda
contradicts the whole concept of PID namespaces that nspawn implements. Since
it's barely ever used, let's just deprecate it and remove it from the options.
It might be useful as a debugging option, hence the functionality is kept
around for now, exposed via an undocumented $SYSTEMD_NSPAWN_SHARE_SYSTEM
environment variable.