1
0
mirror of https://github.com/systemd/systemd.git synced 2024-11-06 16:59:03 +03:00
Commit Graph

2905 Commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek
3ce40911bd pid1: downgrade some rlimit warnings
Since we ignore the result anyway, downgrade errors to warning.

log_oom() will still emit an error, but that's mostly theoretical, so it
is not worth complicating the code to avoid the small inconsistency
2016-10-19 22:17:16 -04:00
Lennart Poettering
5368222db6 core: let's upgrade the log level for service processes dying of signal (#4415)
As suggested in
https://github.com/systemd/systemd/pull/4367#issuecomment-253670328
2016-10-19 19:48:35 -04:00
Luca Bruno
52c239d770 core/exec: add a named-descriptor option ("fd") for streams (#4179)
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.

This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
2016-10-17 20:05:49 -04:00
Lennart Poettering
cdc31c592a Merge pull request #4392 from keszybz/running-timers
Fix for display of elapsed timers
2016-10-17 12:58:55 +02:00
Zbigniew Jędrzejewski-Szmek
6e2c9ce1b6 core/timer: reset next_elapse_*time when timer is not waiting
When the unit that is triggered by a timer is started and running,
we transition to "running" state, and the timer will not elapse again
until the unit has finished running. In this state "systemctl list-timers"
would display the previously calculated next elapse time, which would
now of course be in the past, leading to nonsensical values.

Simply set the next elapse to infinity, which causes list-timers to
show n/a. We cannot specify when the next elapse will happen, possibly
never.

Fixes #4031.
2016-10-17 02:06:20 -04:00
Zbigniew Jędrzejewski-Szmek
ba25d39e44 pid1: do not use mtime==0 as sign of masking (#4388)
It is allowed for unit files to have an mtime==0, so instead of assuming that
any file that had mtime==0 was masked, use the load_state to filter masked
units.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1384150.
2016-10-17 07:15:03 +02:00
Zbigniew Jędrzejewski-Szmek
3b319885c4 tree-wide: introduce free_and_replace helper
It's a common pattern, so add a helper for it. A macro is necessary
because a function that takes a pointer to a pointer would be type specific,
similarly to cleanup functions. Seems better to use a macro.
2016-10-16 23:35:39 -04:00
Zbigniew Jędrzejewski-Szmek
6b430fdb7c tree-wide: use mfree more 2016-10-16 23:35:39 -04:00
Tejun Heo
7d862ab8c2 core: make settings for unified cgroup hierarchy supersede the ones for legacy hierarchy (#4269)
There are overlapping control group resource settings for the unified and
legacy hierarchies.  To help transition, the settings are translated back and
forth.  When both versions of a given setting are present, the one matching the
cgroup hierarchy type in use is used.  Unfortunately, this is more confusing to
use and document than necessary because there is no clear static precedence.

Update the translation logic so that the settings for the unified hierarchy are
always preferred.  systemd.resource-control man page is updated to reflect the
change and reorganized so that the deprecated settings are at the end in its
own section.
2016-10-14 21:07:16 -04:00
Djalal Harouni
e66a2f658b core: make sure to dump ProtectKernelModules= value 2016-10-12 14:12:17 +02:00
Djalal Harouni
4084e8fc89 core: check protect_kernel_modules and private_devices in order to setup NNP 2016-10-12 14:12:07 +02:00
Djalal Harouni
c575770b75 core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=
Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
2016-10-12 14:11:16 +02:00
Djalal Harouni
2cd0a73547 core:sandbox: remove CAP_SYS_RAWIO on PrivateDevices=yes
The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw
data through /proc, ioctl and some other exotic system calls...
2016-10-12 13:39:49 +02:00
Djalal Harouni
502d704e5e core:sandbox: Add ProtectKernelModules= option
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.

This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
2016-10-12 13:31:21 +02:00
Zbigniew Jędrzejewski-Szmek
3ccb886283 Allow block and char classes in DeviceAllow bus properties (#4353)
Allowed paths are unified betwen the configuration file parses and the bus
property checker. The biggest change is that the bus code now allows "block-"
and "char-" classes. In addition, path_startswith("/dev") was used in the bus
code, and startswith("/dev") was used in the config file code. It seems
reasonable to use path_startswith() which allows a slightly broader class of
strings.

Fixes #3935.
2016-10-12 11:12:11 +02:00
0xAX
74e7579c17 core/main: get rid from excess check of ACTION_TEST (#4350)
If `--test` command line option was passed, the systemd set skip_setup
to true during bootup. But after this we check again that arg_action is
test or help and opens pager depends on result.

We should skip setup in a case when `--test` is passed, but it is also
safe to set skip_setup in a case of `--help`. So let's remove first
check and move skip_setup = true to the second check.
2016-10-11 17:30:04 -04:00
Lennart Poettering
e0d2adfde6 core: chown() any TTY used for stdin, not just when StandardInput=tty is used (#4347)
If stdin is supplied as an fd for transient units (using the
StandardInputFileDescriptor pseudo-property for transient units), then we
should also fix up the TTY ownership, not just when we opened the TTY
ourselves.

This simply drops the explicit is_terminal_input()-based check. Note that
chown_terminal() internally does a much more appropriate isatty()-based check
anyway, hence we can drop this without replacement.

Fixes: #4260
2016-10-11 14:07:22 -04:00
Zbigniew Jędrzejewski-Szmek
b744e8937c Merge pull request #4067 from poettering/invocation-id
Add an "invocation ID" concept to the service manager
2016-10-11 13:40:50 -04:00
Zbigniew Jędrzejewski-Szmek
ec72b96366 Merge pull request #4337 from poettering/exit-code
Fix for #4275 and more
2016-10-10 21:24:57 -04:00
Lennart Poettering
052364d41f core: simplify if branches a bit
We do the same thing in two branches, let's merge them. Let's also add an
explanatory comment, while we are at it.
2016-10-10 22:57:02 +02:00
Lennart Poettering
f2aed3070d core: make use of IN_SET() in various places in mount.c 2016-10-10 22:57:02 +02:00
Lennart Poettering
1f0958f640 core: when determining whether a process exit status is clean, consider whether it is a command or a daemon
SIGTERM should be considered a clean exit code for daemons (i.e. long-running
processes, as a daemon without SIGTERM handler may be shut down without issues
via SIGTERM still) while it should not be considered a clean exit code for
commands (i.e. short-running processes).

Let's add two different clean checking modes for this, and use the right one at
the appropriate places.

Fixes: #4275
2016-10-10 22:57:01 +02:00
Lennart Poettering
38107f5a4a core: lower exit status "level" at one place
When we print information about PID 1's crashdump subprocess failing. In this
case we *know* that we do not generate LSB exit codes, as it's basically PID 1
itself that exited there.
2016-10-10 22:56:55 +02:00
0xAX
f6dd106c73 main: use strdup instead of free_and_strdup to initialize default unit (#4335)
Previously we've used free_and_strdup() to fill arg_default_unit with unit
name, If we didn't pass default unit name through a kernel command line or
command line arguments. But we can use just strdup() instead of
free_and_strdup() for this, because we will start fill arg_default_unit
only if it wasn't set before.
2016-10-10 22:11:36 +02:00
Lennart Poettering
41e2036eb8 exit-status: kill is_clean_exit_lsb(), move logic to sysv-generator
Let's get rid of is_clean_exit_lsb(), let's move the logic for the special
handling of the two LSB exit codes into the sysv-generator by writing out
appropriate SuccessExitStatus= lines if the LSB header exists. This is not only
semantically more correct, bug also fixes a bug as the code in service.c that
chose between is_clean_exit_lsb() and is_clean_exit() based this check on
whether a native unit files was available for the unit. However, that check was
bogus since a long time, since the SysV generator was introduced and native
SysV script support was removed from PID 1, as in that case a unit file always
existed.
2016-10-10 21:48:08 +02:00
0xAX
c76cf844d6 tree-wide: pass return value of make_null_stdio() to warning instead of errno (#4328)
as @poettering suggested in the #4320
2016-10-10 19:51:33 +02:00
0xAX
10c961b9c9 main: initialize default unit little later (#4321)
systemd fills arg_default_unit during startup with default.target
value. But arg_default_unit may be overwritten in parse_argv() or
parse_proc_cmdline_item().

Let's check value of arg_default_unit after calls of parse_argv()
and parse_proc_cmdline_item() and fill it with default.target if
it wasn't filled before. In this way we will not spend unnecessary
time to for filling arg_default_unit with default.target.
2016-10-09 22:57:03 -04:00
0xAX
9fc932bff1 tree-wide: print warning in a failure case of make_null_stdio() (#4320)
The make_null_stdio() may fail. Let's check its result and print
warning message instead of keeping silence.
2016-10-09 22:55:24 -04:00
Lennart Poettering
4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
Zbigniew Jędrzejewski-Szmek
8f4d640135 core: only warn on short reads on signal fd 2016-10-07 10:05:04 -04:00
Lennart Poettering
875ca88da5 manager: tighten incoming notification message checks
Let's not accept datagrams with embedded NUL bytes. Previously we'd simply
ignore everything after the first NUL byte. But given that sending us that is
pretty ugly let's instead complain and refuse.

With this change we'll only accept messages that have exactly zero or one NUL
bytes at the very end of the datagram.
2016-10-07 12:14:33 +02:00
Lennart Poettering
045a3d5989 manager: be stricter with incomining notifications, warn properly about too large ones
Let's make the kernel let us know the full, original datagram size of the
incoming message. If it's larger than the buffer space provided by us, drop the
whole message with a warning.

Before this change the kernel would truncate the message for us to the buffer
space provided, and we'd not complain about this, and simply process the
incomplete message as far as it made sense.
2016-10-07 12:12:10 +02:00
Lennart Poettering
c55ae51e77 manager: don't ever busy loop when we get a notification message we can't process
If the kernel doesn't permit us to dequeue/process an incoming notification
datagram message it's still better to stop processing the notification messages
altogether than to enter a busy loop where we keep getting notified but can't
do a thing about it.

With this change, manager_dispatch_notify_fd() behaviour is changed like this:

- if an error indicating a spurious wake-up is seen on recvmsg(), ignore it
  (EAGAIN/EINTR)

- if any other error is seen on recvmsg() propagate it, thus disabling
  processing of further wakeups

- if any error is seen on later code in the function, warn about it but do not
  propagate it, as in this cas we're not going to busy loop as the offending
  message is already dequeued.
2016-10-07 12:08:51 +02:00
Lukáš Nykrýn
24dd31c19e core: add possibility to set action for ctrl-alt-del burst (#4105)
For some certification, it should not be possible to reboot the machine through ctrl-alt-delete. Currently we suggest our customers to mask the ctrl-alt-delete target, but that is obviously not enough.

Patching the keymaps to disable that is really not a way to go for them, because the settings need to be easily checked by some SCAP tools.
2016-10-06 21:08:21 -04:00
Lennart Poettering
97f0e76f18 user-util: rework maybe_setgroups() a bit
Let's drop the caching of the setgroups /proc field for now. While there's a
strict regime in place when it changes states, let's better not cache it since
we cannot really be sure we follow that regime correctly.

More importantly however, this is not in performance sensitive code, and
there's no indication the cache is really beneficial, hence let's drop the
caching and make things a bit simpler.

Also, while we are at it, rework the error handling a bit, and always return
negative errno-style error codes, following our usual coding style. This has
the benefit that we can sensible hanld read_one_line_file() errors, without
having to updat errno explicitly.
2016-10-06 19:04:10 +02:00
Lennart Poettering
2d6fce8d7c core: leave PAM stub process around with GIDs updated
In the process execution code of PID 1, before
096424d123 the GID settings where changed before
invoking PAM, and the UID settings after. After the change both changes are
made after the PAM session hooks are run. When invoking PAM we fork once, and
leave a stub process around which will invoke the PAM session end hooks when
the session goes away. This code previously was dropping the remaining privs
(which were precisely the UID). Fix this code to do this correctly again, by
really dropping them else (i.e. the GID as well).

While we are at it, also fix error logging of this code.

Fixes: #4238
2016-10-06 19:04:10 +02:00
Giuseppe Scrivano
36d854780c core: do not fail in a container if we can't use setgroups
It might be blocked through /proc/PID/setgroups
2016-10-06 11:49:00 +02:00
Giuseppe Scrivano
77531863ca Fix typo 2016-10-05 18:36:48 +02:00
Stefan Schweter
629ff674ac tree-wide: remove consecutive duplicate words in comments 2016-10-04 17:06:25 +02:00
Michael Olbrich
c080fbce9c automount: make sure the expire event is restarted after a daemon-reload (#4265)
If the corresponding mount unit is deserialized after the automount unit
then the expire event is set up in automount_trigger_notify(). However, if
the mount unit is deserialized first then the automount unit is still in
state AUTOMOUNT_DEAD and automount_trigger_notify() aborts without setting
up the expire event.
Explicitly call automount_start_expire() during coldplug to make sure that
the expire event is set up as necessary.

Fixes #4249.
2016-10-04 16:13:27 +02:00
Zbigniew Jędrzejewski-Szmek
a63ee40751 core: do not try to create /run/systemd/transient in test mode
This prevented systemd-analyze from unprivileged operation on older systemd
installations, which should be possible.
Also, we shouldn't touch the file system in test mode even if we can.
2016-10-01 22:53:17 +02:00
Zbigniew Jędrzejewski-Szmek
dd5e7000cb core: complain if Before= dep on .device is declared
[Unit]
Before=foobar.device

[Service]
ExecStart=/bin/true
Type=oneshot

$ systemd-analyze verify before-device.service
before-device.service: Dependency Before=foobar.device ignored (.device units cannot be delayed)
2016-10-01 22:53:17 +02:00
Zbigniew Jędrzejewski-Szmek
5fd2c135f1 core: update warning message
"closing all" might suggest that _all_ fds received with the notification message
will be closed. Reword the message to clarify that only the "unused" ones will be
closed.
2016-10-01 11:01:31 +02:00
Zbigniew Jędrzejewski-Szmek
c4bee3c40e core: get rid of unneeded state variable
No functional change.
2016-10-01 11:01:31 +02:00
Zbigniew Jędrzejewski-Szmek
a86b76753d pid1: more informative error message for ignored notifications
It's probably easier to diagnose a bad notification message if the
contents are printed. But still, do anything only if debugging is on.
2016-09-29 22:57:57 +02:00
Zbigniew Jędrzejewski-Szmek
8523bf7dd5 pid1: process zero-length notification messages again
This undoes 531ac2b234. I acked that patch without looking at the code
carefully enough. There are two problems:
- we want to process the fds anyway
- in principle empty notification messages are valid, and we should
  process them as usual, including logging using log_unit_debug().
2016-09-29 22:57:57 +02:00
Franck Bui
9987750e7a pid1: don't return any error in manager_dispatch_notify_fd() (#4240)
If manager_dispatch_notify_fd() fails and returns an error then the handling of
service notifications will be disabled entirely leading to a compromised system.

For example pid1 won't be able to receive the WATCHDOG messages anymore and
will kill all services supposed to send such messages.
2016-09-29 19:44:34 +02:00
Jorge Niedbalski
531ac2b234 If the notification message length is 0, ignore the message (#4237)
Fixes #4234.

Signed-off-by: Jorge Niedbalski <jnr@metaklass.org>
2016-09-29 05:26:16 -04:00
Evgeny Vereshchagin
cc238590e4 Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-28 04:50:30 +03:00
Paweł Szewczyk
00bb64ecfa core: Fix USB functionfs activation and clarify its documentation (#4188)
There was no certainty about how the path in service file should look
like for usb functionfs activation. Because of this it was treated
differently in different places, which made this feature unusable.

This patch fixes the path to be the *mount directory* of functionfs, not
ep0 file path and clarifies in the documentation that ListenUSBFunction should be
the location of functionfs mount point, not ep0 file itself.
2016-09-26 18:45:47 +02:00
Djalal Harouni
8f81a5f61b core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
2016-09-25 12:52:27 +02:00
Djalal Harouni
b6c432ca7e core:namespace: simplify ProtectHome= implementation
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
2016-09-25 12:41:16 +02:00
Djalal Harouni
f471b2afa1 core: simplify ProtectSystem= implementation
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.

With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
2016-09-25 12:21:25 +02:00
Djalal Harouni
49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni
2652c6c103 core:namespace: simplify mount calculation
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
 src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)
2016-09-25 11:25:00 +02:00
Djalal Harouni
11a30cec2a core:namespace: put paths protected by ProtectKernelTunables= in
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
2016-09-25 11:16:44 +02:00
Djalal Harouni
9c94d52e09 core:namespace: minor improvements to append_mounts() 2016-09-25 11:03:21 +02:00
Lennart Poettering
cefc33aee2 execute: move SMACK setup code into its own function
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
2016-09-25 10:52:57 +02:00
Lennart Poettering
cd2902c954 namespace: drop all mounts outside of the new root directory
There's no point in mounting these, if they are outside of the root directory
we'll move to.
2016-09-25 10:52:57 +02:00
Lennart Poettering
54500613a4 main: minor simplification 2016-09-25 10:52:57 +02:00
Lennart Poettering
ba128bb809 execute: filter low-level I/O syscalls if PrivateDevices= is set
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
2016-09-25 10:52:57 +02:00
Lennart Poettering
8f1ad200f0 namespace: don't make the root directory of a namespace a mount if it already is one
Let's not stack mounts needlessly.
2016-09-25 10:42:18 +02:00
Lennart Poettering
d944dc9553 namespace: chase symlinks for mounts to set up in userspace
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.

(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)

Fixes: #3867
2016-09-25 10:42:18 +02:00
Lennart Poettering
1e4e94c881 namespace: invoke unshare() only after checking all parameters
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.

This way, the window where we can roll back is larger (not that it matters
IRL...)
2016-09-25 10:42:18 +02:00
Lennart Poettering
096424d123 execute: drop group priviliges only after setting up namespace
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
2016-09-25 10:42:18 +02:00
Lennart Poettering
63bb64a056 core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.

This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
2016-09-25 10:42:18 +02:00
Lennart Poettering
3f815163ff core: introduce ProtectSystem=strict
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.

In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.

While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad3 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
2016-09-25 10:42:18 +02:00
Lennart Poettering
160cfdbed3 namespace: add some debug logging when enforcing InaccessiblePaths= 2016-09-25 10:42:18 +02:00
Lennart Poettering
6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00
Lennart Poettering
7648a565d1 namespace: when enforcing fs namespace restrictions suppress redundant mounts
If /foo is marked to be read-only, and /foo/bar too, then the latter may be
suppressed as it has no effect.
2016-09-25 10:19:15 +02:00
Lennart Poettering
6ee1a919cf namespace: simplify mount_path_compare() a bit 2016-09-25 10:19:10 +02:00
Lennart Poettering
3fbe8dbe41 execute: if RuntimeDirectory= is set, it should be writable
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
2016-09-25 10:19:05 +02:00
Lennart Poettering
be39ccf3a0 execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
2016-09-25 10:18:57 +02:00
Lennart Poettering
07689d5d2c execute: split out creation of runtime dirs into its own functions 2016-09-25 10:18:54 +02:00
Lennart Poettering
fe3c2583be namespace: make sure InaccessibleDirectories= masks all mounts further down
If a dir is marked to be inaccessible then everything below it should be masked
by it.
2016-09-25 10:18:51 +02:00
Lennart Poettering
59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00
Lennart Poettering
72246c2a65 core: enforce seccomp for secondary archs too, for all rules
Let's make sure that all our rules apply to all archs the local kernel
supports.
2016-09-25 10:18:44 +02:00
Zbigniew Jędrzejewski-Szmek
43688c49d1 tree-wide: rename config_parse_many to …_nulstr
In preparation for adding a version which takes a strv.
2016-09-16 10:32:03 -04:00
Evgeny Vereshchagin
47af450af0 Merge pull request #4119 from keszybz/drop-more-kdbus
Drop more kdbus functionality
2016-09-10 09:26:43 +03:00
Kyle Russell
7dd736abec service: fixup ExecStop for socket-activated shutdown (#4120)
Previous fix didn't consider handling multiple ExecStop commands.
2016-09-10 08:55:36 +03:00
Michael Olbrich
0dd99f86ad unit: sent change signal before removing the unit if necessary (#4106)
If the unit is in the dbus queue when it is removed then the last change
signal is never sent. Fix this by checking the dbus queue and explicitly
send the change signal before sending the remove signal.
2016-09-09 16:05:06 +01:00
Zbigniew Jędrzejewski-Szmek
232f6754f6 pid1: drop kdbus_fd and all associated logic 2016-09-09 15:16:26 +01:00
Kyle Russell
f2dbd059a6 service: Continue shutdown on socket activated unit on termination (#4108)
ENOTCONN may be a legitimate return code if the endpoint disappeared,
but the service should still attempt to shutdown cleanly.
2016-09-09 05:34:43 +03:00
Felipe Sateler
d347d9029c seccomp: also detect if seccomp filtering is enabled
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.

While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.

Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html
2016-09-06 20:25:49 -03:00
Lennart Poettering
cf08b48642 core: introduce MemorySwapMax= (#3659)
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
2016-08-31 12:28:54 +02:00
Lennart Poettering
126c6aedb8 load-fragment: Resolve specifiers in OnCalendar and On*Sec (#4045)
Resolves #3534
2016-08-31 12:07:39 +02:00
WaLyong Cho
96e131ea09 core: introduce MemorySwapMax=
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
2016-08-30 11:11:45 +09:00
Barron Rulon
49915de245 mount: add SloppyOptions= to mount_dump() 2016-08-27 10:47:46 -04:00
Barron Rulon
4f8d40a9dc mount: add new ForceUnmount= setting for mount units, mapping to umount(8)'s "-f" switch 2016-08-27 10:46:52 -04:00
Douglas Christman
2507992f6b load-fragment: Resolve specifiers in OnCalendar and On*Sec
Resolves #3534
2016-08-26 12:13:16 -04:00
brulon
e520950a03 mount: add new LazyUnmount= setting for mount units, mapping to umount(8)'s "-l" switch (#3827) 2016-08-26 17:57:22 +02:00
Evgeny Vereshchagin
6afe14ff5b Merge pull request #3984 from poettering/refcnt
permit bus clients to pin units to avoid automatic GC
2016-08-26 16:17:05 +03:00
Felipe Sateler
8dec4a9d2d core,network: Use const qualifiers for block-local variables in macro functions (#4019)
Prevents discard-qualifiers warnings when the passed variable was const
2016-08-23 12:29:30 +03:00
Felipe Sateler
83f12b27d1 core: do not fail at step SECCOMP if there is no kernel support (#4004)
Fixes #3882
2016-08-22 22:40:58 +03:00
Lennart Poettering
390bc2b149 core: let's use set_contains() where appropriate 2016-08-22 16:14:21 +02:00
Lennart Poettering
fe700f46ec core: cache last CPU usage counter, before destorying a cgroup
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
2016-08-22 16:14:21 +02:00
Lennart Poettering
05a98afd3e core: add Ref()/Unref() bus calls for units
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.

Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.

Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
2016-08-22 16:14:21 +02:00
Zbigniew Jędrzejewski-Szmek
2056ec1927 Merge pull request #3965 from htejun/systemd-controller-on-unified 2016-08-19 19:58:01 -04:00
Lennart Poettering
16d901e251 Merge pull request #3987 from keszybz/console-color-setup
Rework console color setup
2016-08-19 19:36:09 +02:00
Lennart Poettering
cbf138ebef Merge pull request #3988 from keszybz/journald-dynamic-users
Journald dynamic users
2016-08-19 10:41:26 +02:00
Zbigniew Jędrzejewski-Szmek
61755fdae0 journald: do not create split journals for dynamic users
Dynamic users should be treated like system users, and their logs
should end up in the main system journal.
2016-08-18 23:34:40 -04:00
Zbigniew Jędrzejewski-Szmek
986a34a683 core/dynamic-users: warn when creation of symlinks for dynamic users fails
Also return the first error, since it's most likely to be interesting.
If unlink fails, symlink will usually return EEXIST.
2016-08-18 23:09:29 -04:00
Tejun Heo
f50582649f logind: update empty and "infinity" handling for [User]TasksMax (#3835)
The parsing functions for [User]TasksMax were inconsistent.  Empty string and
"infinity" were interpreted as no limit for TasksMax but not accepted for
UserTasksMax.  Update them so that they're consistent with other knobs.

* Empty string indicates the default value.
* "infinity" indicates no limit.

While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax
handling.

v2: Update empty string to indicate the default value as suggested by Zbigniew
    Jędrzejewski-Szmek.

v3: Fixed empty UserTasksMax handling.
2016-08-18 22:57:53 -04:00
Zbigniew Jędrzejewski-Szmek
bd64d82c1c Revert "pid1: reconnect to the console before being re-executed"
This reverts commit affd7ed1a9.

> So it looks like make_console_stdio() has bad side effect. More specifically it
> does a TIOCSCTTY ioctl (via acquire_terminal()) which sees to disturb the
> process which was using/owning the console.

Fixes #3842.
https://bugs.debian.org/834367
https://bugzilla.redhat.com/show_bug.cgi?id=1367766
2016-08-18 22:30:15 -04:00
Zbigniew Jędrzejewski-Szmek
206fc4b284 systemd: warn when setrlimit fails
This should make it easier to figure things out.
2016-08-18 22:29:56 -04:00
Lennart Poettering
fd63e712b2 core: bypass dynamic user lookups from dbus-daemon
dbus-daemon does NSS name look-ups in order to enforce its bus policy. This
might dead-lock if an NSS module use wants to use D-Bus for the look-up itself,
like our nss-systemd does. Let's work around this by bypassing bus
communication in the NSS module if we run inside of dbus-daemon. To make this
work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't
have to consult the bus, but can still resolve the names.

Note that the normal codepath continues to be via the bus, so that resolving
works from all mount namespaces and is subject to authentication, as before.

This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway
for PID 1.
2016-08-19 00:50:24 +02:00
Lennart Poettering
00d9ef8560 core: add RemoveIPC= setting
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e.  all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.

This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.

In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
2016-08-19 00:37:25 +02:00
Lennart Poettering
51d73fd96a core: move obsolete properties to the end of vtables
This makes it easier to discern the relevant and obsolete parts of the vtables,
and in particular helps when comparing introspection data with the actual
vtable definitions.
2016-08-18 22:49:48 +02:00
Lennart Poettering
92b25bcabb core: make use of uid_is_valid() when checking for UID validity 2016-08-18 22:49:48 +02:00
Lennart Poettering
b4c990e91b unit: remove orphaned cgroup_netclass_id field 2016-08-18 22:49:48 +02:00
Tejun Heo
5da38d0768 core: use the unified hierarchy for the systemd cgroup controller hierarchy
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
2016-08-17 17:44:36 -04:00
Tejun Heo
ca2f6384aa core: rename cg_unified() to cg_all_unified()
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
2016-08-15 18:13:36 -04:00
Zbigniew Jędrzejewski-Szmek
5f9a610ad2 Merge pull request #3905 from htejun/cgroup-v2-cpu
core: add cgroup CPU controller support on the unified hierarchy

(zj: merging not squashing to make it clear against which upstream this patch was developed.)
2016-08-14 18:03:35 -04:00
Zbigniew Jędrzejewski-Szmek
87da8a864f core: amend policy to open up dynamic user queries (#3920) 2016-08-08 23:39:16 +02:00
Tejun Heo
66ebf6c0a1 core: add cgroup CPU controller support on the unified hierarchy
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches.  The situation is explained in
the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads.  This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.

On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us".  [Startup]CPUWeight config
options are added with the usual compat translation.  CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.

v2: - Error in man page corrected.
    - CPU config application in cgroup_context_apply() refactored.
    - CPU accounting now works on unified hierarchy.
2016-08-07 09:45:39 -04:00
Zbigniew Jędrzejewski-Szmek
d87a2ef782 Merge pull request #3884 from poettering/private-users 2016-08-06 17:04:45 -04:00
Zbigniew Jędrzejewski-Szmek
3bb81a80bd Merge pull request #3818 from poettering/exit-status-env
beef up /var/tmp and /tmp handling; set $SERVICE_RESULT/$EXIT_CODE/$EXIT_STATUS on ExecStop= and make sure root/nobody are always resolvable
2016-08-05 20:55:08 -04:00
Lennart Poettering
ceab9e2dee Merge pull request #3900 from keszybz/fix-3607
Fix 3607
2016-08-05 17:03:09 +02:00
Zbigniew Jędrzejewski-Szmek
80a58668d9 socket: add helper function to remove code duplication 2016-08-05 08:24:00 -04:00
Zbigniew Jędrzejewski-Szmek
ea8f50f808 core/socket: include remote address in the message when dropping connection
Without the address the message is not very useful.

Aug 04 23:52:21 rawhide systemd[1]: testlimit.socket: Too many incoming connections (4) from source ::1, dropping connection.
2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek
3ebcd323bd systemd: do not serialize peer, bump count when deserializing socket instead 2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek
9dfb64f87d core/service: serialize and deserialize accept_socket
This fixes an issue during reexec — the count of connections would be lost:

[zbyszek@fedora-rawhide ~]$ systemctl status testlimit.socket | grep Connected
 Accepted: 1; Connected: 1

[zbyszek@fedora-rawhide ~]$ sudo systemctl daemon-reexec
[zbyszek@fedora-rawhide ~]$ systemctl status testlimit.socket | grep Connected
 Accepted: 1; Connected: 0

With the patch, Connected count is preserved.

Also add "Accept Socket" to the dump output for services.
2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek
166cf510c2 core/socket: rework SocketPeer refcounting
Make functions and definitions that don't need to be shared local to
socket.c.
2016-08-05 08:12:31 -04:00
Lennart Poettering
41bf0590cc util-lib: unify parsing of nice level values
This adds parse_nice() that parses a nice level and ensures it is in the right
range, via a new nice_is_valid() helper. It then ports over a number of users
to this.

No functional changes.
2016-08-05 11:18:32 +02:00
Zbigniew Jędrzejewski-Szmek
9a73653c3e systemd: convert peers_by_address to a set 2016-08-04 23:53:07 -04:00
Lennart Poettering
b08af3b127 core: only set the watchdog variables in ExecStart= lines 2016-08-04 23:08:05 +02:00
Lennart Poettering
a0fef983ab core: remember first unit failure, not last unit failure
Previously, the result value of a unit was overriden with each failure that
took place, so that the result always reported the last failure that took
place.

With this commit this is changed, so that the first failure taking place is
stored instead. This should normally not matter much as multiple failures are
sufficiently uncommon. However, it improves one behaviour: if we send SIGABRT
to a service due to a watchdog timeout, then this currently would be reported
as "coredump" failure, rather than the "watchodg" failure it really is. Hence,
in order to report information about the type of the failure, and not about
the effect of it, let's change this from all unit type to store the first, not
the last failure.

This addresses the issue pointed out here:

https://github.com/systemd/systemd/pull/3818#discussion_r73433520
2016-08-04 23:08:05 +02:00
Lennart Poettering
136dc4c435 core: set $SERVICE_RESULT, $EXIT_CODE and $EXIT_STATUS in ExecStop=/ExecStopPost= commands
This should simplify monitoring tools for services, by passing the most basic
information about service result/exit information via environment variables,
thus making it unnecessary to retrieve them explicitly via the bus.
2016-08-04 23:08:05 +02:00
0xAX
13811bf909 main: use pager for --dump-configuration-items (#3894) 2016-08-04 22:52:24 +02:00
Lennart Poettering
af9d16e10a core: use the correct APIs to determine whether a dual timestamp is initialized 2016-08-04 16:27:07 +02:00
Lennart Poettering
9c1a61adba core: move masking of chroot/permission masking into service_spawn()
Let's fix up the flags fields in service_spawn() rather than its callers, in
order to simplify things a bit.
2016-08-04 16:27:07 +02:00
Lennart Poettering
c39f1ce24d core: turn various execution flags into a proper flags parameter
The ExecParameters structure contains a number of bit-flags, that were so far
exposed as bool:1, change this to a proper, single binary bit flag field. This
makes things a bit more expressive, and is helpful as we add more flags, since
these booleans are passed around in various callers, for example
service_spawn(), whose signature can be made much shorter now.

Not all bit booleans from ExecParameters are moved into the flags field for
now, but this can be added later.
2016-08-04 16:27:07 +02:00
Lennart Poettering
eb18df724b Merge pull request #2471 from michaelolbrich/transient-mounts
allow transient mounts and automounts
2016-08-04 16:16:04 +02:00
David Michael
5124866d73 util-lib: add parse_percent_unbounded() for percentages over 100% (#3886)
This permits CPUQuota to accept greater values as documented.
2016-08-04 13:09:54 +02:00
Lennart Poettering
d251207d55 core: add new PrivateUsers= option to service execution
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.

If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.

This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.

Example command to test this:

        systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh

This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
2016-08-03 20:42:04 +02:00
Lennart Poettering
7049382803 execute: don't set $SHELL and $HOME for services, if they don't contain interesting data 2016-08-03 14:52:16 +02:00
Lennart Poettering
6af760f3b2 core: inherit TERM from PID 1 for all services started on /dev/console
This way, invoking nspawn from a shell in the best case inherits the TERM
setting all the way down into the login shell spawned in the container.

Fixes: #3697
2016-08-03 14:52:16 +02:00
Lennart Poettering
43992e57e0 core: drop spurious newline 2016-08-03 14:52:16 +02:00
Susant Sahani
9d56542764 socket: add support to control no. of connections from one source (#3607)
Introduce MaxConnectionsPerSource= that is number of concurrent
connections allowed per IP.

RFE: 1939
2016-08-02 13:48:23 -04:00
Ismo Puustinen
96694e998b main: load Smack policy before IMA policy (#3859)
IMA wiki says: "If the IMA policy contains LSM labels, then the LSM
policy must be loaded prior to the IMA policy." Right now, in case of
Smack, the IMA policy is loaded before the Smack policy. Move the order
around to allow Smack labels to be used in IMA policy.
2016-08-02 08:58:30 -04:00
0xAX
494294d6f8 main: get rid of ACTION_DONE (#3849)
the ACTION_DONE was introduced in the 4288f61921 (dbus: automatically
generate and install introspection files ) commit and was used in
systemd --introspect command.

Later 'introspect' command was removed in the ca2871d9b (bus: remove
static introspection file export) commit and have no users anymore.

So we can remove it.
2016-08-01 12:38:25 +02:00
Zbigniew Jędrzejewski-Szmek
dadd6ecfa5 Merge pull request #3728 from poettering/dynamic-users 2016-07-25 16:40:26 -04:00
Michael Olbrich
87d41d6244 automount: don't cancel mount/umount request on reload/reexec (#3670)
All pending tokens are already serialized correctly and will be handled
when the mount unit is done.

Without this a 'daemon-reload' cancels all pending tokens. Any process
waiting for the mount will continue with EHOSTDOWN.
This can happen when the mount unit waits for it's dependencies, e.g.
network, devices, fsck, etc.
2016-07-25 20:04:02 +02:00
Michael Olbrich
2de0b9e913 transaction: don't cancel jobs for units with IgnoreOnIsolate=true (#3671)
This is important if a job was queued for a unit but not yet started.
Without this, the job will be canceled and is never executed even though
IgnoreOnIsolate it set to 'true'.
2016-07-25 20:02:55 +02:00
Lennart Poettering
43eb109aa9 core: change ExecStart=! syntax to ExecStart=+ (#3797)
As suggested by @mbiebl we already use the "!" special char in unit file
assignments for negation, hence we should not use it in a different context for
privileged execution. Let's use "+" instead.
2016-07-25 16:53:33 +02:00
Zbigniew Jędrzejewski-Szmek
31b14fdb6f Merge pull request #3777 from poettering/id128-rework
uuid/id128 code rework
2016-07-22 21:18:41 -04:00
Lennart Poettering
5052c4eadd Merge pull request #3753 from poettering/tasks-max-scale
Add support for relative TasksMax= specifications, and bump default for services
2016-07-22 17:40:12 +02:00
Alessandro Puccetti
0d9e799102 cgroup: whitelist inaccessible devices for "auto" and "closed" DevicePolicy.
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
2016-07-22 16:08:31 +02:00
Lennart Poettering
409093fe10 nss: add new "nss-systemd" NSS module for mapping dynamic users
With this NSS module all dynamic service users will be resolvable via NSS like
any real user.
2016-07-22 15:53:45 +02:00
Lennart Poettering
6f3e79859d core: enforce user/group name validity also when creating transient units 2016-07-22 15:53:45 +02:00
Lennart Poettering
29206d4619 core: add a concept of "dynamic" user ids, that are allocated as long as a service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).

For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.

A simple way to test this is:

        systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-22 15:53:45 +02:00
Lennart Poettering
66dccd8d85 core: be stricter when parsing User=/Group= fields
Let's verify the validity of the syntax of the user/group names set.
2016-07-22 15:53:45 +02:00
Lennart Poettering
b3785cd5e6 core: check for overflow when handling scaled MemoryLimit= settings
Just in case...
2016-07-22 15:33:13 +02:00
Harald Hoyer
2424b6bd71 macros.systemd.in: add %systemd_ordering (#3776)
To remove the hard dependency on systemd, for packages, which function
without a running systemd the %systemd_ordering macro can be used to
ensure ordering in the rpm transaction. %systemd_ordering makes sure,
the systemd rpm is installed prior to the package, so the %pre/%post
scripts can execute the systemd parts.

Installing systemd afterwards though, does not result in the same outcome.
2016-07-22 09:33:13 -04:00
Lennart Poettering
79baeeb96d core: change TasksMax= default for system services to 15%
As it turns out 512 is max number of tasks per service is hit by too many
applications, hence let's bump it a bit, and make it relative to the system's
maximum number of PIDs. With this change the new default is 15%. At the
kernel's default pids_max value of 32768 this translates to 4915. At machined's
default TasksMax= setting of 16384 this translates to 2457.

Why 15%? Because it sounds like a round number and is close enough to 4096
which I was going for, i.e. an eight-fold increase over the old 512

Summary:

            | on the host | in a container
old default |         512 |           512
new default |        4915 |          2457
2016-07-22 15:33:13 +02:00
Lennart Poettering
84af7821b6 main: simplify things a bit by moving container check into fixup_environment() 2016-07-22 15:33:12 +02:00
Lennart Poettering
f7903e8db6 core: rename MemoryLimitByPhysicalMemory transient property to MemoryLimitScale
That way, we can neatly keep this in line with the new TasksMaxScale= option.

Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet,
hence this change should be unproblematic without breaking API.
2016-07-22 15:33:12 +02:00
Lennart Poettering
83f8e80857 core: support percentage specifications on TasksMax=
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
2016-07-22 15:33:12 +02:00
Lennart Poettering
4b1afed01f core: rework machine-id-setup.c to use the calls from id128-util.[ch]
This allows us to delete quite a bit of code and make the whole thing a lot
shorter.
2016-07-22 12:59:36 +02:00
Lennart Poettering
e042eab720 main: make sure set_machine_id() doesn't clobber arg_machine_id on failure 2016-07-22 12:59:36 +02:00
Lennart Poettering
15b1248a6b machine-id-setup: port machine_id_commit() to new id128-util.c APIs 2016-07-22 12:59:36 +02:00
Lennart Poettering
910fd145f4 sd-id128: split UUID file read/write code into new id128-util.[ch]
We currently have code to read and write files containing UUIDs at various
places. Unify this in id128-util.[ch], and move some other stuff there too.

The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/),
because they are actually the backend of sd_id128_get_machine() and
sd_id128_get_boot().

In follow-up patches we can use this reduce the code in nspawn and
machine-id-setup by adopted the common implementation.
2016-07-22 12:59:36 +02:00
Martin Pitt
bf3dd08a81 Merge pull request #3762 from poettering/sigkill-log
log about all processes we forcibly kill
2016-07-22 09:18:30 +02:00
Martin Pitt
5c3c778014 Merge pull request #3764 from poettering/assorted-stuff-2
Assorted fixes
2016-07-22 09:10:04 +02:00
Alessandro Puccetti
31d28eabc1 nspawn: enable major=0/minor=0 devices inside the container (#3773)
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
2016-07-21 17:39:38 +02:00
Thomas H. P. Andersen
f8298f7be3 core: remove duplicate includes (#3771) 2016-07-21 10:52:07 +02:00
Topi Miettinen
176e51b710 namespace: fix wrong return value from mount(2) (#3758)
Fix bug introduced by #3263: mount(2) return value is 0 or -1, not errno.

Thanks to Evgeny Vereshchagin (@evverx) for reporting.
2016-07-20 17:43:21 +03:00
Lennart Poettering
33df919d5c execute: make sure JoinsNamespaceOf= doesn't leak ns fds to executed processes 2016-07-20 14:53:15 +02:00
Lennart Poettering
fe048ce56a namespace: add a (void) cast 2016-07-20 14:53:15 +02:00
Lennart Poettering
9ce9347880 core: normalize header inclusion in execute.h a bit
We don't actually need any functionality from cgroup.h in execute.h, hence
don't include that. However, we do need the Unit structure from unit.h, hence
include that, and move it as late as possible, since it needs the definitions
from execute.h.
2016-07-20 14:53:15 +02:00
Lennart Poettering
7a1ab780c4 execute: normalize connect_logger_as() parameters slightly
All other functions in execute.c that need the unit id take a Unit* parameter
as first argument. Let's change connect_logger_as() to follow a similar logic.
2016-07-20 14:53:15 +02:00
Lennart Poettering
3862e809d0 core: when a scope was abandoned, always log about processes we kill
After all, if a unit is abandoned, all processes inside of it may be considered
"left over" and are something we should better log about.
2016-07-20 14:35:15 +02:00
Lennart Poettering
f4b0fb236b core: make sure RequestStop signal is send directed
This was accidentally left commented out for debugging purposes, let's fix that
and make the signal directed again.
2016-07-20 14:35:15 +02:00
Lennart Poettering
1d98fef17d core: when forcibly killing/aborting left-over unit processes log about it
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.

This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.

Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
2016-07-20 14:35:15 +02:00
Lennart Poettering
5fd7cf6fe2 namespace: minor improvements
We generally try to avoid strerror(), due to its threads-unsafety, let's do
this here, too.

Also, let's be tiny bit more explanatory with the log messages, and let's
shorten a few things.
2016-07-20 08:57:25 +02:00
Lennart Poettering
d724118e20 core: hide legacy bus properties
We usually hide legacy bus properties from introspection. Let's do that for the
InaccessibleDirectories= properties too.

The properties stay accessible if requested, but they won't be listed anymore
if people introspect the unit.
2016-07-20 08:55:50 +02:00
Alessandro Puccetti
2a624c36e6 doc,core: Read{Write,Only}Paths= and InaccessiblePaths=
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.

Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
2016-07-19 17:22:02 +02:00
Alessandro Puccetti
c4b4170746 namespace: unify limit behavior on non-directory paths
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.

Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
2016-07-19 17:22:02 +02:00
Lukáš Nykrýn
ccc2c98e1b manager: don't skip sigchld handler for main and control pid for services (#3738)
During stop when service has one "regular" pid one main pid and one
control pid and the sighld for the regular one is processed first the
unit_tidy_watch_pids will skip the main and control pid and does not
remove them from u->pids(). But then we skip the sigchld event because we
already did one in the iteration and there are two pids in u->pids.

v2: Use general unit_main_pid() and unit_control_pid() instead of
reaching directly to service structure.
2016-07-16 15:04:13 -04:00
Zbigniew Jędrzejewski-Szmek
2ed968802c tree-wide: get rid of selinux_context_t (#3732)
9eb9c93275
deprecated selinux_context_t. Replace with a simple char* everywhere.

Alternative fix for #3719.
2016-07-15 18:44:02 +02:00
Zbigniew Jędrzejewski-Szmek
1071fd0823 macros: provide %_systemdgeneratordir and %_systemdusergeneratordir (#3672)
... as requested in
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/DJ7HDNRM5JGBSA4HL3UWW5ZGLQDJ6Y7M/.
Adding the macro makes it marginally easier to create generators
for outside projects.

I opted for "generatordir" and "usergeneratordir" to match
%unitdir and %userunitdir. OTOH, "_systemd" prefix makes it obvious
that this is related to systemd. "%_generatordir" would be to generic
of a name.
2016-07-15 09:35:49 +02:00
Lennart Poettering
2e79d1828a shutdown: already sync IO before we enter the final killing spree
This way, slow IO journald has to wait for can't cause it to reach the killing
spree timeout and is hit by SIGKILL in addition to SIGTERM.
2016-07-12 17:38:19 +02:00
Lennart Poettering
d450612953 shutdown: use 90s SIGKILL timeout
There's really no reason to use 10s here, let's instead default to 90s like we
do for everything else.

The SIGKILL during the final killing spree is in most regards the fourth level
of a safety net, after all: any normal service should have already been stopped
during the normal service shutdown logic, first via SIGTERM and then SIGKILL,
and then also via SIGTERM during the finall killing spree before we send
SIGKILL. And as a fourth level safety net it should only be required in
exceptional cases, which means it's safe to rais the default timeout, as normal
shutdowns should never be delayed by it.

Note that journald excludes itself from the normal service shutdown, and relies
on the final killing spree to terminate it (this is because it wants to cover
the normal shutdown phase's complete logging). If the system's IO is
excessively slow, then the 10s might not be enough for journald to sync
everything to disk and logs might get lost during shutdown.
2016-07-12 17:32:30 +02:00
Michael Biebl
595bfe7df2 Various fixes for typos found by lintian (#3705) 2016-07-12 12:52:11 +02:00
Luca Bruno
391b81cd03 seccomp: only abort on syscall name resolution failures (#3701)
seccomp_syscall_resolve_name() can return a mix of positive and negative
(pseudo-) syscall numbers, while errors are signaled via __NR_SCMP_ERROR.
This commit lets the syscall filter parser only abort on real parsing
failures, letting libseccomp handle pseudo-syscall number on its own
and allowing proper multiplexed syscalls filtering.
2016-07-12 11:55:26 +02:00
Torstein Husebø
61233823aa treewide: fix typos and remove accidental repetition of words 2016-07-11 16:18:43 +02:00
Evgeny Vereshchagin
224d3d8266 Merge pull request #3680 from joukewitteveen/pam-env
Follow up on #3503 (pass service env vars to PAM sessions)
2016-07-08 17:33:12 +03:00
Jouke Witteveen
84eada2f7f execute: Do not alter call-by-ref parameter on failure
Prevent free from being called on (a part of) the call-by-reference
variable env when setup_pam fails.
2016-07-08 09:42:48 +02:00
David Michael
4f952a3f07 core: queue loading transient units after setting their properties (#3676)
The unit load queue can be processed in the middle of setting the
unit's properties, so its load_state would no longer be UNIT_STUB
for the check in bus_unit_set_properties(), which would cause it to
incorrectly return an error.
2016-07-08 05:43:01 +02:00
Daniel Mack
78a4ee591a cgroup: fix memory cgroup limit regression on kernel 3.10 (#3673)
Commit da4d897e ("core: add cgroup memory controller support on the unified
hierarchy (#3315)") changed the code in src/core/cgroup.c to always write
the real numeric value from the cgroup parameters to the
"memory.limit_in_bytes" attribute file.

For parameters set to CGROUP_LIMIT_MAX, this results in the string
"18446744073709551615" being written into that file, which is UINT64_MAX.
Before that commit, CGROUP_LIMIT_MAX was special-cased to the string "-1".

This causes a regression on CentOS 7, which is based on kernel 3.10, as the
value is interpreted as *signed* 64 bit, and clamped to 0:

[root@n54 ~]# echo 18446744073709551615 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
[root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
0

[root@n54 ~]# echo -1 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
[root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
9223372036854775807

Hence, all units that are subject to the limits enforced by the memory
controller will crash immediately, even though they have no actual limit
set. This happens to for the user.slice, for instance:

[  453.577153] Hardware name: SeaMicro SM15000-64-CC-AA-1Ox1/AMD Server CRB, BIOS Estoc.3.72.19.0018 08/19/2014
[  453.587024]  ffff880810c56780 00000000aae9501f ffff880813d7fcd0 ffffffff816360fc
[  453.594544]  ffff880813d7fd60 ffffffff8163109c ffff88080ffc5000 ffff880813d7fd28
[  453.602120]  ffffffff00000202 fffeefff00000000 0000000000000001 ffff880810c56c03
[  453.609680] Call Trace:
[  453.612156]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
[  453.617324]  [<ffffffff8163109c>] dump_header+0x8e/0x214
[  453.622671]  [<ffffffff8116d20e>] oom_kill_process+0x24e/0x3b0
[  453.628559]  [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
[  453.634969]  [<ffffffff811d4155>] mem_cgroup_oom_synchronize+0x575/0x5a0
[  453.641721]  [<ffffffff811d3520>] ? mem_cgroup_charge_common+0xc0/0xc0
[  453.648299]  [<ffffffff8116da84>] pagefault_out_of_memory+0x14/0x90
[  453.654621]  [<ffffffff8162f4cc>] mm_fault_error+0x68/0x12b
[  453.660233]  [<ffffffff81642012>] __do_page_fault+0x3e2/0x450
[  453.666017]  [<ffffffff816420a3>] do_page_fault+0x23/0x80
[  453.671467]  [<ffffffff8163e308>] page_fault+0x28/0x30
[  453.676656] Task in /user.slice/user-0.slice/user@0.service killed as a result of limit of /user.slice/user-0.slice/user@0.service
[  453.688477] memory: usage 0kB, limit 0kB, failcnt 7
[  453.693391] memory+swap: usage 0kB, limit 9007199254740991kB, failcnt 0
[  453.700039] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
[  453.706076] Memory cgroup stats for /user.slice/user-0.slice/user@0.service: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[  453.725702] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  453.733614] [ 2837]     0  2837    11950      899      23        0             0 (systemd)
[  453.741919] Memory cgroup out of memory: Kill process 2837 ((systemd)) score 1 or sacrifice child
[  453.750831] Killed process 2837 ((systemd)) total-vm:47800kB, anon-rss:3188kB, file-rss:408kB

Fix this issue by special-casing the UINT64_MAX case again.
2016-07-07 19:29:35 -07:00
Jouke Witteveen
1280503b7e execute: Cleanup the environment early
By cleaning up before setting up PAM we maintain control of overriding
behavior in setting variables. Otherwise, pam_putenv is in control.
This also makes sure we use a cleaned up environment in replacing
variables in argv.
2016-07-07 14:15:50 +02:00
Kyle Walker
1e706c8dff manager: Fixing a debug printf formatting mistake (#3640)
A 'llu' formatting statement was used in a debugging printf statement
instead of a 'PRIu64'. Correcting that mistake here.
2016-07-01 20:03:35 +03:00
Lennart Poettering
b12cc5b0f8 Merge pull request #3634 from disneyworldguy/v2sigchld
manager: Only invoke a single sigchld per unit within a cleanup cycle
2016-06-30 15:57:39 -07:00
Martin Pitt
f15461b2b2 Merge pull request #3596 from poettering/machine-clean
make "machinectl clean" asynchronous, and open it up via PolicyKit
2016-06-30 21:30:35 +02:00
Kyle Walker
36f20ae3b2 manager: Only invoke a single sigchld per unit within a cleanup cycle
By default, each iteration of manager_dispatch_sigchld() results in a unit level
sigchld event being invoked. For scope units, this results in a scope_sigchld_event()
which can seemingly stall for workloads that have a large number of PIDs within the
scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry
as a result of pid_is_unwaited().

v2:
This patch resolves this condition by only paying to cost of a sigchld in the underlying
scope unit once per sigchld iteration. A new "sigchldgen" member resides within the
Unit struct. The Manager is incremented via the sd event loop, accessed via
sd_event_get_iteration, and the Unit member is set to the same value as the manager each
time that a sigchld event is invoked. If the Manager iteration value and Unit member
match, the sigchld event is not invoked for that iteration.
2016-06-30 15:16:47 -04:00
Franck Bui
6edefe0b06 pid1: restore console color support for containers (#3595)
Commit 3a18b60489 introduced a regression that
disabled the color mode for container.

This patch fixes this.
2016-06-24 16:08:43 +02:00
Lennart Poettering
2b40998d3c cgroup: minor coding style fix 2016-06-24 15:59:24 +02:00
Lennart Poettering
f4170c671b execute: add a new easy-to-use RestrictRealtime= option to units
It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and
SCHED_DEADLINE is blocked, which my be used to lock up the system.
2016-06-23 01:45:45 +02:00
Lennart Poettering
abd84d4d83 execute: be a little less drastic when MemoryDenyWriteExecute= hits
Let's politely refuse with EPERM rather than kill the whole thing right-away.
2016-06-23 01:35:04 +02:00
Lennart Poettering
686d9ba614 execute: set PR_SET_NO_NEW_PRIVS also in case the exec memory protection is used
This was forgotten when MemoryDenyWriteExecute= was added: we should set NNP in
all cases when we set seccomp filters.
2016-06-23 01:33:07 +02:00