1
1
mirror of https://github.com/systemd/systemd-stable.git synced 2025-01-11 05:17:44 +03:00
Commit Graph

32386 Commits

Author SHA1 Message Date
Lennart Poettering
0b1f3c768c tree-wide: reopen log when we need to log in FORK_CLOSE_ALL_FDS children
In a number of occasions we use FORK_CLOSE_ALL_FDS when forking off a
child, since we don't want to pass fds to the processes spawned (either
because we later want to execve() some other process there, or because
our child might hang around for longer than expected, in which case it
shouldn't keep our fd pinned). This also closes any logging fds, and
thus means logging is turned off in the child. If we want to do proper
logging, explicitly reopen the logs hence in the child at the right
time.

This is particularly crucial in the umount/remount children we fork off
the shutdown binary, as otherwise the children can't log, which is
why #8155 is harder to debug than necessary: the log messages we
generate about failing mount() system calls aren't actually visible on
screen, as they done in the child processes where the log fds are
closed.
2018-02-22 00:35:00 +01:00
Lennart Poettering
e18805fbd0 shutdown: explicitly set a log target in shutdown.c
We used to set this, but this was dropped when shutdown got taught to
get the target passed in from the regular PID 1. Let's readd this to
make things more explanatory, and cover all grounds, since after all the
target passed is in theory an optional part of the protocol between the
regular PID 1 and the shutdown PID 1.
2018-02-22 00:33:12 +01:00
Lennart Poettering
d405394c5c shutdown: always pass errno to logging functions
We have them, let's propagate them.
2018-02-22 00:32:31 +01:00
Lennart Poettering
e38b8a407a log: only open kmsg on fallback if we actually want to use it
Previously, we'd try to open kmsg on failure of the journal/syslog even
if no automatic fallback to kmsg was requested — and we wouldn't even
use the open connection afterwards...
2018-02-22 00:31:36 +01:00
Lennart Poettering
00adeed99f umount: beef up logging when umount/remount child processes fail
Let's extend what we log if umount/remount doesn't work correctly as we
expect.

See #8155
2018-02-21 23:57:21 +01:00
Lennart Poettering
6079afa9b0 user-sessions: let's simplify our code paths a bit
Let's always go through mac_selinux_finish(), by making our
success/failure codepaths more alike.

This also saves a few lines of code. Yay!
2018-02-21 23:44:39 +01:00
Zbigniew Jędrzejewski-Szmek
ad383382c7 hwdb: drop bad definition for Cordless Wave Pro keyboard (#8230)
[I'm just submitting the solution originally suggested by @barzog.
Nevertheless, this looks pretty straightforward, we don't want to define
any keys on a universal receiver.

Note that this definition was added back in
aedc2eddd1, when we didn't yet have
support for figuring out what hardware is connected behind a logitech
receiver.]

In 60-keyboard.hwdb there is a definition of # Cordless Wave Pro
evdev:input:b0003v046DpC52[9B]*

which in fact not a cordless keyboard but an USB receiver to which different
types of keyboard can be connected. The solution is to completely clean
definition evdev:input:b0003v046DpC52B* from there.

I: Bus=0003 Vendor=046d Product=c52b Version=0111
N: Name="Logitech USB Receiver"
P: Phys=usb-0000:00:1d.0-1.8/input1
S: Sysfs=/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.8/4-1.8:1.1/0003:046D:C52B.0005/input/input20
U: Uniq=
H: Handlers=kbd mouse0 event8
B: PROP=0
B: EV=1f
B: KEY=3007f 0 0 83ffff17aff32d bf54444600000000 ffff0001 130f978b17c000 6773fad941dfed 9ed68000004400 10000002
B: REL=1c3
B: ABS=100000000
B: MSC=10

Fixed #8095.
2018-02-22 08:21:28 +10:00
Lennart Poettering
5128346127 bpf: reset "extra" IP accounting counters when turning off IP accounting for a unit
We maintain an "extra" set of IP accounting counters that are used when
we systemd is reloaded to carry over the counters from the previous run.
Let's reset these to zero whenever IP accounting is turned off. If we
don't do this then turning off IP accounting and back on later wouldn't
reset the counters, which is quite surprising and different from how our
CPU time counting works.
2018-02-21 16:43:36 +01:00
Lennart Poettering
aa2b6f1d2b bpf: rework how we keep track and attach cgroup bpf programs
So, the kernel's management of cgroup/BPF programs is a bit misdesigned:
if you attach a BPF program to a cgroup and close the fd for it it will
stay pinned to the cgroup with no chance of ever removing it again (or
otherwise getting ahold of it again), because the fd is used for
selecting which BPF program to detach. The only way to get rid of the
program again is to destroy the cgroup itself.

This is particularly bad for root the cgroup (and in fact any other
cgroup that we cannot realistically remove during runtime, such as
/system.slice, /init.scope or /system.slice/dbus.service) as getting rid
of the program only works by rebooting the system.

To counter this let's closely keep track to which cgroup a BPF program
is attached and let's implicitly detach the BPF program when we are
about to close the BPF fd.

This hence changes the bpf_program_cgroup_attach() function to track
where we attached the program and changes bpf_program_cgroup_detach() to
use this information. Moreover bpf_program_unref() will now implicitly
call bpf_program_cgroup_detach().

In order to simplify things, bpf_program_cgroup_attach() will now
implicitly invoke bpf_program_load_kernel() when necessary, simplifying
the caller's side.

Finally, this adds proper reference counting to BPF programs. This
is useful for working with two BPF programs in parallel: the BPF program
we are preparing for installation and the BPF program we so far
installed, shortening the window when we detach the old one and reattach
the new one.
2018-02-21 16:43:36 +01:00
Lennart Poettering
e0ad39fc52 bpf-program: make bpf_program_load_kernel() idempotent
Let's "seal" off the BPF program as soo as bpf_program_load_kernel() is
called, which allows us to make it idempotent: since the program can't
be modified anymore after being turned into a kernel object it's safe to
shortcut behaviour if called multiple times.
2018-02-21 16:43:36 +01:00
Lennart Poettering
72a1db0bb2 test: don't complain if bpffs is world-writable
Apparently, world-writable bpffs is intended by the kernel folks, hence
let's make sure we don't choke on it on our tests.
2018-02-21 16:43:36 +01:00
Lennart Poettering
13a141f046 namespace: protect bpf file system as part of ProtectKernelTunables=
It also exposes kernel objects, let's better include this in
ProtectKernelTunables=.
2018-02-21 16:43:36 +01:00
Lennart Poettering
6590080851 mount-setup: always use the same source as fstype for the API VFS we mount
So far, for all our API VFS mounts we used the fstype also as mount
source, let's do that for the cgroupsv2 mounts too. The kernel doesn't
really care about the source for API VFS, but it's visible to the user,
hence let's clean this up and follow the rule we otherwise follow.
2018-02-21 16:43:36 +01:00
Lennart Poettering
acf7f253de bpf: use BPF_F_ALLOW_MULTI flag if it is available
This new kernel 4.15 flag permits that multiple BPF programs can be
executed for each packet processed: multiple per cgroup plus all
programs defined up the tree on all parent cgroups.

We can use this for two features:

1. Finally provide per-slice IP accounting (which was previously
   unavailable)

2. Permit delegation of BPF programs to services (i.e. leaf nodes).

This patch beefs up PID1's handling of BPF to enable both.

Note two special items to keep in mind:

a. Our inner-node BPF programs (i.e. the ones we attach to slices) do
   not enforce IP access lists, that's done exclsuively in the leaf-node
   BPF programs. That's a good thing, since that way rules in leaf nodes
   can cancel out rules further up (i.e. for example to implement a
   logic of "disallow everything except httpd.service"). Inner node BPF
   programs to accounting however if that's requested. This is
   beneficial for performance reasons: it means in order to provide
   per-slice IP accounting we don't have to add up all child unit's
   data.

b. When this code is run on pre-4.15 kernel (i.e. where
   BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice
   units unavailable (i.e. revert to behaviour from before this commit).
   For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which
   means that BPF delegation is not available there at all, if IP
   fw/acct is turned on for the unit. This is a change from earlier
   behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our
   fw/acct would lose its effect as soon as delegation was turned on and
   some client made use of that. I think the new behaviour is the safer
   choice in this case, as silent bypassing of our fw rules is not
   possible anymore. And if people want proper delegation then the way
   out is a more modern kernel or turning off IP firewalling/acct for
   the unit algother.
2018-02-21 16:43:36 +01:00
Lennart Poettering
43b7f24b5e bpf: mount bpffs by default on boot
We make heavy use of BPF functionality these days, hence expose the BPF
file system too by default now. (Note however, that we don't actually
make use bpf file systems object yet, but we might later on too.)
2018-02-21 16:43:36 +01:00
Lennart Poettering
9b3c189786 bpf-program: optionally take fd of program to detach
This is useful for BPF_F_ALLOW_MULTI programs, where the kernel requires
us to specify the fd.
2018-02-21 16:43:36 +01:00
Lennart Poettering
2ae7ee58fa bpf: beef up bpf detection, check if BPF_F_ALLOW_MULTI is supported
This improves the BPF/cgroup detection logic, and looks whether
BPF_ALLOW_MULTI is supported. This flag allows execution of multiple
BPF filters in a recursive fashion for a whole cgroup tree. It enables
us to properly report IP accounting for slice units, as well as
delegation of BPF support to units without breaking our own IP
accounting.
2018-02-21 16:43:36 +01:00
Lennart Poettering
8b15fca85b bpf: add new bpf.h header copy from 4.15 kernel 2018-02-21 16:43:36 +01:00
Yu Watanabe
9323298657 test: fix test for TemporaryFileSystem= (#8241)
This makes test-execute work on SELinux enabled systems.

Fixes the issue reported at
https://github.com/systemd/systemd/pull/7908#discussion_r169583540
2018-02-21 16:43:35 +01:00
Zbigniew Jędrzejewski-Szmek
5187dd2c40 missing_syscall: when adding syscall replacements, use different names (#8229)
In meson.build we check that functions are available using:
    meson.get_compiler('c').has_function('foo')
which checks the following:
- if __stub_foo or __stub___foo are defined, return false
- if foo is declared (a pointer to the function can be taken), return true
- otherwise check for __builtin_memfd_create

_stub is documented by glibc as
   It defines a symbol '__stub_FUNCTION' for each function
   in the C library which is a stub, meaning it will fail
   every time called, usually setting errno to ENOSYS.

So if __stub is defined, we know we don't want to use the glibc version, but
this doesn't tell us if the name itself is defined or not. If it _is_ defined,
and we define our replacement as an inline static function, we get an error:

In file included from ../src/basic/missing.h:1358:0,
                 from ../src/basic/util.h:47,
                 from ../src/basic/calendarspec.h:29,
                 from ../src/basic/calendarspec.c:34:
../src/basic/missing_syscall.h:65:19: error: static declaration of 'memfd_create' follows non-static declaration
 static inline int memfd_create(const char *name, unsigned int flags) {
                   ^~~~~~~~~~~~
.../usr/include/bits/mman-shared.h:46:5: note: previous declaration of 'memfd_create' was here
 int memfd_create (const char *__name, unsigned int __flags) __THROW;
     ^~~~~~~~~~~~

To avoid this problem, call our inline functions different than glibc,
and use a #define to map the official name to our replacement.

Fixes #8099.

v2:
- use "missing_" as the prefix instead of "_"

v3:
- rebase and update for statx()

  Unfortunately "statx" is also present in "struct statx", so the define
  causes issues. Work around this by using a typedef.

I checked that systemd compiles with current glibc
(glibc-devel-2.26-24.fc27.x86_64) if HAVE_MEMFD_CREATE, HAVE_GETTID,
HAVE_PIVOT_ROOT, HAVE_SETNS, HAVE_RENAMEAT2, HAVE_KCMP, HAVE_KEYCTL,
HAVE_COPY_FILE_RANGE, HAVE_BPF, HAVE_STATX are forced to 0.

Setting HAVE_NAME_TO_HANDLE_AT to 0 causes an issue, but it's not because of
the define, but because of struct file_handle.
2018-02-21 14:04:50 +01:00
Evgeny Vereshchagin
7b13a721f5
Merge pull request #8235 from keszybz/skip-nobody-test
Skip tests for nobody if necessary
2018-02-21 12:19:02 +03:00
Zbigniew Jędrzejewski-Szmek
7559b2da10 test-user-util: skip most tests for nobody if synthentization is off
When synthetisation is turned off, there's just too many ways those tests can
go wrong. We are not interested in verifying that the db on disk is correct,
let's just skip all checks.

In the first version of this patch, I recorded if we detected a mismatch during
configuration and only skipped tests in that case, but actually it is possible
to change the host configuration between our configuration phase and running
of the tests. It's just more robust to skip always. (This is particularly true
if tests are installed.)
2018-02-21 09:57:35 +01:00
Alan Jenkins
59e00b2a16
Merge pull request #7908 from yuwata/rfe-7895
core: add TemporaryFileSystem= setting and 'tmpfs' option to ProtectHome=
2018-02-21 08:57:11 +00:00
Evgeny Vereshchagin
24a01950a3 tests: stop using nobody in test-udev.pl (#8239)
`nobody` is a special user, whose credentials should be extracted with
`get_user_creds`. `getpwnam` called in `test-udev.pl` is a bit different,
which causes the test to fail with the following error:
```
device '/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda' expecting node/link 'node'
  expected permissions are: nobody::0600
  created permissions are : 65534:0:0600
permissions: error
add:         ok
remove:      ok
```
The ideal fix would probably be to implement `get_user_creds` in Perl, but in this
PR the issue is simply got around by using `daemon` instead of `nobody`.

Closes https://github.com/systemd/systemd/issues/8196.
2018-02-21 08:34:42 +01:00
Yu Watanabe
24743efe2d doc: update TRANSIENT-SETTINGS.md 2018-02-21 09:18:22 +09:00
Yu Watanabe
784ad252ea core: add DBus API for TemporaryFileSystem= 2018-02-21 09:18:20 +09:00
Yu Watanabe
e4da7d8c79 core: add new option 'tmpfs' to ProtectHome=
This make ProtectHome= setting can take 'tmpfs'. This is mostly
equivalent to `TemporaryFileSystem=/home /run/user /root`.
2018-02-21 09:18:17 +09:00
Yu Watanabe
4cac89bd7c test: add tests for TemporaryFileSystem= 2018-02-21 09:18:14 +09:00
Yu Watanabe
c10b460b5a man: add documents for TemporaryFileSystem= 2018-02-21 09:18:11 +09:00
Yu Watanabe
2abd4e388a core: add new setting TemporaryFileSystem=
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
2018-02-21 09:17:52 +09:00
Yu Watanabe
4ca763a902 core/namespace: make '-' prefix in Bind{,ReadOnly}Paths= work
Each path in `Bind{ReadOnly}Paths=` accept '-' prefix. However,
the prefix is completely ignored.
This makes it work as expected.
2018-02-21 09:07:56 +09:00
Yu Watanabe
72d967df3e nspawn: remove unnecessary mount option parsing logic 2018-02-21 09:06:55 +09:00
Yu Watanabe
6ef8df2ba8 mount-util: call mount_option_mangle() in mount_verbose() 2018-02-21 09:06:53 +09:00
Yu Watanabe
f27b437b4c test: add tests for mount_option_mangle() 2018-02-21 09:06:51 +09:00
Yu Watanabe
9e7f941acb mount-util: add mount_option_mangle()
This is used in the later commits.
2018-02-21 09:06:47 +09:00
Yu Watanabe
4ff4c98a39 core: simplify DBus API for BindPaths= 2018-02-21 09:06:32 +09:00
Yu Watanabe
280921f29e core: fix DBus API for AppArmorProfile= and SmackProcessLabel= 2018-02-21 09:05:40 +09:00
Yu Watanabe
8e06d57ccb core/execute: clear bind_mounts 2018-02-21 09:05:37 +09:00
Yu Watanabe
a635a7aec6 core/execute: simplify compile_bind_mounts()
It is not necessary to re-assign error code.
2018-02-21 09:05:35 +09:00
Yu Watanabe
30ffb010ff nspawn: fix indentation 2018-02-21 09:05:33 +09:00
Yu Watanabe
f5c52a7724 core/namespace: remove unused argument 2018-02-21 09:05:30 +09:00
Yu Watanabe
e282f51f57 core/namespace: use free_and_replace() 2018-02-21 09:05:21 +09:00
Yu Watanabe
55fe743273 core/namespace: fix comment 2018-02-21 09:05:18 +09:00
Yu Watanabe
89bd586cd3 core/namespace: merge PRIVATE_VAR_TMP into PRIVATE_TMP 2018-02-21 09:05:16 +09:00
Yu Watanabe
2a2969fd5d core/namespace: make arguments const if possible 2018-02-21 09:05:14 +09:00
Zbigniew Jędrzejewski-Szmek
e79d0b59c8 journalctl: improve hint about lack of access for --user-unit=...
When running journalctl --user-unit=foo as an unprivileged user we could get
the usual hint:
Hint: You are currently not seeing messages from the system and other users.
      Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
      ...
But with --user-unit our filter is:
(((_UID=0 OR _UID=1000) AND OBJECT_SYSTEMD_USER_UNIT=foo.service) OR
 ((_UID=0 OR _UID=1000) AND COREDUMP_USER_UNIT=foo.service) OR
 (_UID=1000 AND USER_UNIT=foo.service) OR
 (_UID=1000 AND _SYSTEMD_USER_UNIT=foo.service))
so we would never see messages from other users.

We could still see messages from the system. In fact, on my machine the
only messages with OBJECT_SYSTEMD_USER_UNIT= are from the system:
journalctl  $(journalctl -F OBJECT_SYSTEMD_USER_UNIT|sed 's/.*/OBJECT_SYSTEMD_USER_UNIT=\0/')

Thus, a more correct hint is that we cannot see messages from the system.
Make it so.

Fixes #7887.
2018-02-20 22:36:01 +01:00
Zbigniew Jędrzejewski-Szmek
52c6e6a8a0 test-user-util: print function delimiters
This makes it easier to see what is going on. Crashes may happen in a
nested test_{uid,gid}_to_name_one() function, and the default backtrace
doesn't show the actual string being tested.
2018-02-20 22:10:45 +01:00
Zbigniew Jędrzejewski-Szmek
2e10cc5649
Merge pull request #8222 from poettering/journal-by-inode
make sure we detect journal rotation even on inotify q overrun
2018-02-20 21:36:25 +01:00
Zbigniew Jędrzejewski-Szmek
8f7cbe730a TODO: drop one item
C.f. 7cb609115c.
2018-02-20 17:25:05 +01:00
Lennart Poettering
4c2e1b399f xattr-util: use crtime/btime if statx() is available for implementation of fd_setcrtime() and friends
The Linux kernel exposes the birth time now for files through statx()
hence make use of it where available. We keep the xattr logic in place
for this however, since only a subset of file systems on Linux currently
expose the birth time. NFS and tmpfs for example do not support it. OTOH
there are other file systems that do support the birth time but might
not support xattrs (smb…), hence make the best of the two, in particular
in order to deal with journal files copied between file system types and
to maintain compatibility with older file systems that are updated to
newer version of the file system.
2018-02-20 15:41:49 +01:00