IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Instead of blindly creating another bind mount for read-only mounts,
check if there's already one we can use, and if so, use it. Also,
recursively mark all submounts read-only too. Also, ignore autofs mounts
when remounting read-only unless they are already triggered.
nspawn and the container child use eventfd to wait and notify each other
that they are ready so the container setup can be completed.
However in its current form the wait/notify event ignore errors that
may especially affect the child (container).
On errors the child will jump to the "child_fail" label and terminate
with _exit(EXIT_FAILURE) without notifying the parent. Since the eventfd
is created without the "EFD_NONBLOCK" flag, this leaves the parent
blocking on the eventfd_read() call. The container can also be killed
at any moment before execv() and the parent will not receive
notifications.
We can fix this by using cheap mechanisms, the new high level eventfd
API and handle SIGCHLD signals:
* Keep the cheap eventfd and EFD_NONBLOCK flag.
* Introduce eventfd states for parent and child to sync.
Child notifies parent with EVENTFD_CHILD_SUCCEEDED on success or
EVENTFD_CHILD_FAILED on failure and before _exit(). This prevents the
parent from waiting on an event that will never come.
* If the child is killed before execv() or before notifying the parent,
we install a NOP handler for SIGCHLD which will interrupt blocking calls
with EINTR. This gives a chance to the parent to call wait() and
terminate in main().
* If there are no errors, parent will block SIGCHLD, restore default
handler and notify child which will do execv(), then parent will pass
control to process_pty() to do its magic.
This was exposed in part by:
https://bugs.freedesktop.org/show_bug.cgi?id=76193
Reported-by: Tobias Hunger tobias.hunger@gmail.com
Move the container wait logic into its own wait_for_container() function
and add two status codes: CONTAINER_TERMINATED or CONTAINER_REBOOTED.
The status will be stored in its argument, this way we handle:
a) Return negative on failures.
b) Return zero on success and set the status to either
CONTAINER_REBOOTED or CONTAINER_TERMINATED.
These status codes are used to terminate nspawn or loop again in case of
CONTAINER_REBOOTED.
This undoes part of commit e6a4a517be.
Instead of removing the error message about non-empty journal bind mount
directories, simply downgrade the message to a warning and proceed.
Currently if nspawn was called with --link-journal=host or
--link-journal=auto and the right /var/log/journal/machine-id/ exists
then the bind mount the subdirectory into the container might fail due
to the ~/mycontainer/var/log/journal/machine-id/ of the container not
being empty.
There is no reason to check if the container journal subdir is empty
since there will be a bind mount on top of it. The user asked for a bind
mount so give it.
Note: a next call with --link-journal=guest may fail due to the
/var/log/journal/machine-id/ on the host not being empty.
https://bugs.freedesktop.org/show_bug.cgi?id=76193
Reported-by: Tobias Hunger <tobias.hunger@gmail.com>
Use a static table with all the typing information, rather than repeated
switch statements. This should make it a lot simpler to add new types.
We need to keep all the type info to be able to create containers
without exposing their implementation details to the users of the library.
As a freebee we verify the types of appended/read attributes.
The API is extended to nicely deal with unions of container types.
safe_close_pair() is more like safe_close(), except that it handles
pairs of fds, and doesn't make and misleading allusion, as it works
similarly well for socketpairs() as for pipe()s...
safe_close() automatically becomes a NOP when a negative fd is passed,
and returns -1 unconditionally. This makes it easy to write lines like
this:
fd = safe_close(fd);
Which will close an fd if it is open, and reset the fd variable
correctly.
By making use of this new scheme we can drop a > 200 lines of code that
was required to test for non-negative fds or to reset the closed fd
variable afterwards.
With systemd 211 nspawn attempts to create the home directory for the
given uid. However, if the home directory already exists then it will
fail. Don't error out on -EEXIST.
We still need to make sure that no two MAC addresses are the same, so we use
a logic similar to what is used in udev to generate MAC addresses, and base
it on a hash of the host's machine ID and thecontainer's name.
When the container runs a different native architecture than the host we
shouldn't attempt to load the container's NSS modules with the host's
libc. Instead, resolve UID/GID by invoking /usr/bin/getent in the
container. The tool should be fairly universally available and allows us
to do resolving of the UID/GID with the container's libc in a parsable
format.
https://bugs.freedesktop.org/show_bug.cgi?id=75733
We overmount /dev/console with an external pty anyway, hence there's no
point in using the real major/minor when we create the node to
overmount. Instead, use the one of /dev/null now.
This fixes a race against the cgroup device controller setup we are
using. In case /dev/console was create before the cgroup policy was
applied all was good, but if created in the opposite order the mknod()
would fail, since creating /dev/console is not allowed by it. Creating
/dev/null instances is however permitted, and hence use it.
Running 'systemd-nspawn -D /srv/Fedora/' gave me this error:
Failed to read /proc/self/loginuid: No such file or directory
Container Fedora failed with error code 1.
This patch fixes the problem.
Previously the returned object of constructor functions where sometimes
returned as last, sometimes as first and sometimes as second parameter.
Let's clean this up a bit. Here are the new rules:
1. The object the new object is derived from is put first, if there is any
2. The object we are creating will be returned in the next arguments
3. This is followed by any additional arguments
Rationale:
For functions that operate on an object we always put that object first.
Constructors should probably not be too different in this regard. Also,
if the additional parameters might want to use varargs which suggests to
put them last.
Note that this new scheme only applies to constructor functions, not to
all other functions. We do give a lot of freedom for those.
Note that this commit only changes the order of the new functions we
added, for old ones we accept the wrong order and leave it like that.
If -flto is used then gcc will generate a lot more warnings than before,
among them a number of use-without-initialization warnings. Most of them
without are false positives, but let's make them go away, because it
doesn't really matter.
Arch Linux uses nspawn as a container for building packages and needs
to be able to start a 32bit chroot from a 64bit host. 24fb111207
disrupted this feature when seccomp handling was added.
This adds the host side of the veth link to the given bridge.
Also refactor the creation of the veth interfaces a bit to set it up
from the host rather than the container. This simplifies the addition
to the bridge, but otherwise the behavior is unchanged.
The kernel still doesn't support audit in containers, so let's make use
of seccomp and simply turn it off entirely. We can get rid of this big
as soon as the kernel is fixed again.
So far we followed the rule to always indicate the "flavour" of
constructors after the "_new_" or "_open_" in the function name, so
let's keep things in sync here for rtnl and do the same.
The "sd_" prefix is supposed to be used on exported symbols only, and
not in the middle of names. Let's drop it from the cleanup macros hence,
to make things simpler.
The bus cleanup macros don't carry the "sd_" either, so this brings the
APIs a bit nearer.
Let's always call the security labels the same way:
SMACK: "Smack Label"
SELINUX: "SELinux Security Context"
And the low-level encapsulation is called "seclabel". Now let's hope we
stick to this vocabulary in future, too, and don't mix "label"s and
"security contexts" and so on wildly.
- As suggested, prefix argument variables with "arg_" how we do this
usually.
- As suggested, don't involve memory allocations when storing command
line arguments.
- Break --help text at 80 chars
- man: explain that this is about SELinux
- don't do unnecessary memory allocations when putting together mount
option string
This patch adds to new options:
-Z PROCESS_LABEL
This specifies the process label to run on processes run within the container.
-L FILE_LABEL
The file label to assign to memory file systems created within the container.
For example if you wanted to wrap an container with SELinux sandbox labels, you could execute a command line the following
chcon system_u:object_r:svirt_sandbox_file_t:s0:c0,c1 -R /srv/container
systemd-nspawn -L system_u:object_r:svirt_sandbox_file_t:s0:c0,c1 -Z system_u:system_r:svirt_lxc_net_t:s0:c0,c1 -D /srv/container /bin/sh
Similar to PrivateNetwork=, PrivateTmp= introduce PrivateDevices= that
sets up a private /dev with only the API pseudo-devices like /dev/null,
/dev/zero, /dev/random, but not any physical devices in them.
On kdbus user credentials are not translated across PID namespaces, but
simply invalidated if sender and receiver namespaces don't match. This
makes it impossible to properly authenticate requests from different PID
namespaces (which is probably a good thing). Hence, register the machine
in the parent and not the client and properly synchronize this.
If --link-journal=host or --link-journal=guest is used, this totally
cannot work and we exit with an error. If however --link-journal=auto
or --link-journal=no is used, just display a warning.
Having the same machine id can happen if booting from the same
filesystem as the host. Since other things mostly function correctly,
let's allow that.
https://bugs.freedesktop.org/show_bug.cgi?id=68369
The only problem is that libgen.h #defines basename to point to it's
own broken implementation instead of the GNU one. This can be fixed
by #undefining basename.
We want to emphasize bus connections as per-thread communication
primitives, hence introduce a concept of a per-thread default bus, and
make use of it everywhere.
Among other things this also adds a few things necessary for the change:
- Considerably more powerful error returning APIs in libsystemd-bus
- Adapter for connecting an sd_bus to an sd_event
- As I reworked the PolicyKit logic to the new library I also made it
asynchronous, so that PolicyKit requests of one user cannot block out
another user anymore.
- We always use the macro names for common bus error. That way it is
harder to mistype them since the compiler will notice
We were already creating the file if it was missing, and this way
containers can reconfigure the file without running into problems.
This also makes resolv.conf handling more alike to handling of
/etc/localtime, which is also not a bind mount.
Previously, if a file's bind mount destination didn't exist, nspawn
would blindly create a directory, and the subsequent bind mount would
fail. Examine the filetype of the source and ensure that, if the
destination does not exist, that it is created appropriately.
Also go one step further and ensure that the filetypes of the source
and destination match.
Commit 2e996f4d4b added an include
of linux/netlink.h
This kernel header is not self contained in the linux 2.6 kernel
which breaks compilation with an unknown type sa_family_t
A workaround is to include linux/netlink.h after sys/socket.h
Embedded folks don't need the machine registration stuff, hence it's
nice to make this optional. Also, I'd expect that machinectl will grow
additional commands quickly, for example to join existing containers and
suchlike, hence it's better keeping that separate from loginctl.
- This changes all logind cgroup objects to use slice objects rather
than fixed croup locations.
- logind can now collect minimal information about running
VMs/containers. As fixed cgroup locations can no longer be used we
need an entity that keeps track of machine cgroups in whatever slice
they might be located. Since logind already keeps track of users,
sessions and seats this is a trivial addition.
- nspawn will now register with logind and pass various bits of metadata
along. A new option "--slice=" has been added to place the container
in a specific slice.
- loginctl gained commands to list, introspect and terminate machines.
- user.slice and machine.slice will now be pulled in by logind.service,
since only logind.service requires this slice.
Session objects will now get the .session suffix, user objects the .user
suffix, nspawn containers the .nspawn suffix.
This also changes the user cgroups to be named after the numeric UID
rather than the username, since this allows us the parse these paths
standalone without requiring access to the cgroup file system.
This also changes the mapping of instanced units to cgroups. Instead of
mapping foo@bar.service to the cgroup path /user/foo@.service/bar we
will now map it to /user/foo@.service/foo@bar.service, in order to
ensure that all our objects are properly suffixed in the tree.
As discussed with Dan Berrange it's a good idea to suffix all objects in
the cgroup tree with ".something", so that when the system is
partitioned using a resource management tool we can drop objects of
different types into the same partition directory without generate
namespace conflicts.
We'l add this to the Pax Control Group document as soon as write access
to the fdo wiki is restored.
All attributes are stored as text, since root_directory is already
text, and it seems easier to have all of them in text format.
Attributes are written in the trusted. namespace, because the kernel
currently does not allow user. attributes on cgroups. This is a PITA,
and CAP_SYS_ADMIN is required to *read* the attributes. Alas.
A second pipe is opened for the child to signal the parent that the
cgroup hierarchy has been set up.
nspawn will overmount resolv.conf if it exists. Since e.g.
default install with yum doesn't create /etc/resolv.conf,
a container created with yum will not have network. This
seems undesirable, and since we overmount the file anyway,
let's create it too.
Also, mounting a read-write /etc/resolv.conf in the container
is treated as a failure, since it makes it possible to
modify hosts /etc/resolv.conf from inside the container.
Containers will now carry a label (normally derived from the root
directory name, but configurable by the user), and the container's root
cgroup is /machine/<label>. This label is called "machine name", and can
cover both containers and VMs (as soon as libvirt also makes use of
/machine/).
libsystemd-login can be used to query the machine name from a process.
This patch also includes numerous clean-ups for the cgroup code.
Before, we would initialize many fields twice: first
by filling the structure with zeros, and then a second
time with the real values. We can let the compiler do
the job for us, avoiding one copy.
A downside of this patch is that text gets slightly
bigger. This is because all zero() calls are effectively
inlined:
$ size build/.libs/systemd
text data bss dec hex filename
before 897737 107300 2560 1007597 f5fed build/.libs/systemd
after 897873 107300 2560 1007733 f6075 build/.libs/systemd
… actually less than 1‰.
A few asserts that the parameter is not null had to be removed. I
don't think this changes much, because first, it is quite unlikely
for the assert to fail, and second, an immediate SEGV is almost as
good as an assert.
This reverts commit cb96a2c69a.
It is not a mistake to pass args when -b is specified. They will simply
be passed on to the container's init.
The manpage needs fixing, that's true.
systemd-nspawn will now print the PID of the child.
An example showing how to enter the container is added
to the man page.
Support for nsenter without an explicit command was
added in https://github.com/karelzak/util-linux/commit/5758069
(post v2.22.2). So this example requires both a new kernel
and the latest util-linux.
stdout can be redirected to a regular file. Regular files don't support epoll.
nspawn failed with: "Failed to register fds in epoll: Operation not permitted".
If stdout does not support epoll, assume it's always writable.
Due to the brokeness of much of the userspace audit code we cannot
really start too many systems without the audit caps set. To make nspawn
easier to use just add the audit caps by default.
To boot up containers successfully the kernel's auditing needs to be
turned off still (use "audit=0" on the kernel command line), but at
least no manual caps have to be passed anymore.
In the long run auditing will be fixed for containers and ve virtualized
properly at which time it should be safe to enable these caps anyway.
Most things seem to function fine without /dev/shm, but it is expected
to be there (quoting linux/Documentation/filesystems/tmpfs.txt:
glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX
shared memory (shm_open, shm_unlink)).
Since /tmp/ is already mounted as tmpfs, it would be enough to mkdir
/tmp/shm and chmod it. Mounting it separately has the advantage that
it can be easily remounted to change the quota.
This creates /dev/fd, /dev/stdin, /dev/stdout, /dev/stderr, and
/dev/core as symlinks to /proc on container creation. Except for
/dev/core, these are needed for shells like bash to be fully functional.
also a number of minor fixups and bug fixes: spelling, oom errors
that didn't print errors, not properly forwarding error codes,
few more consistency issues, et cetera
glibc/glib both use "out of memory" consistantly so maybe we should
consider that instead of this.
Eliminates one string out of a number of binaries. Also fixes extra newline
in udev/scsi_id
This also ensures that caps dropped from the bounding set are also
dropped from the inheritable set, to be extra-secure. Usually that should
change very little though as the inheritable set is empty for all our uses
anyway.