IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
So far, for all our API VFS mounts we used the fstype also as mount
source, let's do that for the cgroupsv2 mounts too. The kernel doesn't
really care about the source for API VFS, but it's visible to the user,
hence let's clean this up and follow the rule we otherwise follow.
This new kernel 4.15 flag permits that multiple BPF programs can be
executed for each packet processed: multiple per cgroup plus all
programs defined up the tree on all parent cgroups.
We can use this for two features:
1. Finally provide per-slice IP accounting (which was previously
unavailable)
2. Permit delegation of BPF programs to services (i.e. leaf nodes).
This patch beefs up PID1's handling of BPF to enable both.
Note two special items to keep in mind:
a. Our inner-node BPF programs (i.e. the ones we attach to slices) do
not enforce IP access lists, that's done exclsuively in the leaf-node
BPF programs. That's a good thing, since that way rules in leaf nodes
can cancel out rules further up (i.e. for example to implement a
logic of "disallow everything except httpd.service"). Inner node BPF
programs to accounting however if that's requested. This is
beneficial for performance reasons: it means in order to provide
per-slice IP accounting we don't have to add up all child unit's
data.
b. When this code is run on pre-4.15 kernel (i.e. where
BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice
units unavailable (i.e. revert to behaviour from before this commit).
For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which
means that BPF delegation is not available there at all, if IP
fw/acct is turned on for the unit. This is a change from earlier
behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our
fw/acct would lose its effect as soon as delegation was turned on and
some client made use of that. I think the new behaviour is the safer
choice in this case, as silent bypassing of our fw rules is not
possible anymore. And if people want proper delegation then the way
out is a more modern kernel or turning off IP firewalling/acct for
the unit algother.
We make heavy use of BPF functionality these days, hence expose the BPF
file system too by default now. (Note however, that we don't actually
make use bpf file systems object yet, but we might later on too.)
This improves the BPF/cgroup detection logic, and looks whether
BPF_ALLOW_MULTI is supported. This flag allows execution of multiple
BPF filters in a recursive fashion for a whole cgroup tree. It enables
us to properly report IP accounting for slice units, as well as
delegation of BPF support to units without breaking our own IP
accounting.
In meson.build we check that functions are available using:
meson.get_compiler('c').has_function('foo')
which checks the following:
- if __stub_foo or __stub___foo are defined, return false
- if foo is declared (a pointer to the function can be taken), return true
- otherwise check for __builtin_memfd_create
_stub is documented by glibc as
It defines a symbol '__stub_FUNCTION' for each function
in the C library which is a stub, meaning it will fail
every time called, usually setting errno to ENOSYS.
So if __stub is defined, we know we don't want to use the glibc version, but
this doesn't tell us if the name itself is defined or not. If it _is_ defined,
and we define our replacement as an inline static function, we get an error:
In file included from ../src/basic/missing.h:1358:0,
from ../src/basic/util.h:47,
from ../src/basic/calendarspec.h:29,
from ../src/basic/calendarspec.c:34:
../src/basic/missing_syscall.h:65:19: error: static declaration of 'memfd_create' follows non-static declaration
static inline int memfd_create(const char *name, unsigned int flags) {
^~~~~~~~~~~~
.../usr/include/bits/mman-shared.h:46:5: note: previous declaration of 'memfd_create' was here
int memfd_create (const char *__name, unsigned int __flags) __THROW;
^~~~~~~~~~~~
To avoid this problem, call our inline functions different than glibc,
and use a #define to map the official name to our replacement.
Fixes#8099.
v2:
- use "missing_" as the prefix instead of "_"
v3:
- rebase and update for statx()
Unfortunately "statx" is also present in "struct statx", so the define
causes issues. Work around this by using a typedef.
I checked that systemd compiles with current glibc
(glibc-devel-2.26-24.fc27.x86_64) if HAVE_MEMFD_CREATE, HAVE_GETTID,
HAVE_PIVOT_ROOT, HAVE_SETNS, HAVE_RENAMEAT2, HAVE_KCMP, HAVE_KEYCTL,
HAVE_COPY_FILE_RANGE, HAVE_BPF, HAVE_STATX are forced to 0.
Setting HAVE_NAME_TO_HANDLE_AT to 0 causes an issue, but it's not because of
the define, but because of struct file_handle.
When synthetisation is turned off, there's just too many ways those tests can
go wrong. We are not interested in verifying that the db on disk is correct,
let's just skip all checks.
In the first version of this patch, I recorded if we detected a mismatch during
configuration and only skipped tests in that case, but actually it is possible
to change the host configuration between our configuration phase and running
of the tests. It's just more robust to skip always. (This is particularly true
if tests are installed.)
`nobody` is a special user, whose credentials should be extracted with
`get_user_creds`. `getpwnam` called in `test-udev.pl` is a bit different,
which causes the test to fail with the following error:
```
device '/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda' expecting node/link 'node'
expected permissions are: nobody::0600
created permissions are : 65534:0:0600
permissions: error
add: ok
remove: ok
```
The ideal fix would probably be to implement `get_user_creds` in Perl, but in this
PR the issue is simply got around by using `daemon` instead of `nobody`.
Closes https://github.com/systemd/systemd/issues/8196.
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
When running journalctl --user-unit=foo as an unprivileged user we could get
the usual hint:
Hint: You are currently not seeing messages from the system and other users.
Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
...
But with --user-unit our filter is:
(((_UID=0 OR _UID=1000) AND OBJECT_SYSTEMD_USER_UNIT=foo.service) OR
((_UID=0 OR _UID=1000) AND COREDUMP_USER_UNIT=foo.service) OR
(_UID=1000 AND USER_UNIT=foo.service) OR
(_UID=1000 AND _SYSTEMD_USER_UNIT=foo.service))
so we would never see messages from other users.
We could still see messages from the system. In fact, on my machine the
only messages with OBJECT_SYSTEMD_USER_UNIT= are from the system:
journalctl $(journalctl -F OBJECT_SYSTEMD_USER_UNIT|sed 's/.*/OBJECT_SYSTEMD_USER_UNIT=\0/')
Thus, a more correct hint is that we cannot see messages from the system.
Make it so.
Fixes#7887.
This makes it easier to see what is going on. Crashes may happen in a
nested test_{uid,gid}_to_name_one() function, and the default backtrace
doesn't show the actual string being tested.
The Linux kernel exposes the birth time now for files through statx()
hence make use of it where available. We keep the xattr logic in place
for this however, since only a subset of file systems on Linux currently
expose the birth time. NFS and tmpfs for example do not support it. OTOH
there are other file systems that do support the birth time but might
not support xattrs (smb…), hence make the best of the two, in particular
in order to deal with journal files copied between file system types and
to maintain compatibility with older file systems that are updated to
newer version of the file system.
Let's make use this at various places we call fsync(), to make things
fully reliable, as the kernel devs suggest to first fsync() files and
then fsync() the directories they are located in.
Let's add a common implementation for regular file checks, that are
careful to return the right error code (EISDIR/EISLNK/EBADFD) when we
are encountering a wrong file node.
Let's make sure we aren't confused if a journal file is replaced by a
different one (for example due to rotation) if we are in a q overflow:
let's compare the inode/device information, and if it changed replace
any open file object as needed.
Fixes: #8198
Let's be more careful with the naming, and indicate that the function
is about *named* journal files, and will validate the name as needed.
(in opposition to add_any_file() which doesn't care about names)
Before this change all unit types would default to "private" in the
system service manager and "inherit" to in the user service manager.
With this change this is slightly altered: non-service units of the
system service manager are now run with KeyringMode=shared. This appears
to be the more appropriate choice as isolation is not as desirable for
mount tools, which regularly consume key material. After all mounts are
a shared resource themselves as they appear system-wide hence it makes a
lot of sense to share their key material too.
Fixes: #8159