systemd

mirror of https://github.com/systemd/systemd.git synced 2024-11-06 16:59:03 +03:00

Author	SHA1	Message	Date
Zbigniew Jędrzejewski-Szmek	3ce40911bd	pid1: downgrade some rlimit warnings Since we ignore the result anyway, downgrade errors to warning. log_oom() will still emit an error, but that's mostly theoretical, so it is not worth complicating the code to avoid the small inconsistency	2016-10-19 22:17:16 -04:00
Lennart Poettering	5368222db6	core: let's upgrade the log level for service processes dying of signal (#4415 ) As suggested in https://github.com/systemd/systemd/pull/4367#issuecomment-253670328	2016-10-19 19:48:35 -04:00
Luca Bruno	52c239d770	core/exec: add a named-descriptor option ("fd") for streams (#4179 ) This commit adds a `fd` option to `StandardInput=`, `StandardOutput=` and `StandardError=` properties in order to connect standard streams to externally named descriptors provided by some socket units. This option looks for a file descriptor named as the corresponding stream. Custom names can be specified, separated by a colon. If multiple name-matches exist, the first matching fd will be used.	2016-10-17 20:05:49 -04:00
Lennart Poettering	cdc31c592a	Merge pull request #4392 from keszybz/running-timers Fix for display of elapsed timers	2016-10-17 12:58:55 +02:00
Zbigniew Jędrzejewski-Szmek	6e2c9ce1b6	core/timer: reset next_elapse_*time when timer is not waiting When the unit that is triggered by a timer is started and running, we transition to "running" state, and the timer will not elapse again until the unit has finished running. In this state "systemctl list-timers" would display the previously calculated next elapse time, which would now of course be in the past, leading to nonsensical values. Simply set the next elapse to infinity, which causes list-timers to show n/a. We cannot specify when the next elapse will happen, possibly never. Fixes #4031.	2016-10-17 02:06:20 -04:00
Zbigniew Jędrzejewski-Szmek	ba25d39e44	pid1: do not use mtime==0 as sign of masking (#4388 ) It is allowed for unit files to have an mtime==0, so instead of assuming that any file that had mtime==0 was masked, use the load_state to filter masked units. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1384150.	2016-10-17 07:15:03 +02:00
Zbigniew Jędrzejewski-Szmek	3b319885c4	tree-wide: introduce free_and_replace helper It's a common pattern, so add a helper for it. A macro is necessary because a function that takes a pointer to a pointer would be type specific, similarly to cleanup functions. Seems better to use a macro.	2016-10-16 23:35:39 -04:00
Zbigniew Jędrzejewski-Szmek	6b430fdb7c	tree-wide: use mfree more	2016-10-16 23:35:39 -04:00
Tejun Heo	7d862ab8c2	core: make settings for unified cgroup hierarchy supersede the ones for legacy hierarchy (#4269 ) There are overlapping control group resource settings for the unified and legacy hierarchies. To help transition, the settings are translated back and forth. When both versions of a given setting are present, the one matching the cgroup hierarchy type in use is used. Unfortunately, this is more confusing to use and document than necessary because there is no clear static precedence. Update the translation logic so that the settings for the unified hierarchy are always preferred. systemd.resource-control man page is updated to reflect the change and reorganized so that the deprecated settings are at the end in its own section.	2016-10-14 21:07:16 -04:00
Djalal Harouni	e66a2f658b	core: make sure to dump ProtectKernelModules= value	2016-10-12 14:12:17 +02:00
Djalal Harouni	4084e8fc89	core: check protect_kernel_modules and private_devices in order to setup NNP	2016-10-12 14:12:07 +02:00
Djalal Harouni	c575770b75	core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules= Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.	2016-10-12 14:11:16 +02:00
Djalal Harouni	2cd0a73547	core:sandbox: remove CAP_SYS_RAWIO on PrivateDevices=yes The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw data through /proc, ioctl and some other exotic system calls...	2016-10-12 13:39:49 +02:00
Djalal Harouni	502d704e5e	core:sandbox: Add ProtectKernelModules= option This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.	2016-10-12 13:31:21 +02:00
Zbigniew Jędrzejewski-Szmek	3ccb886283	Allow block and char classes in DeviceAllow bus properties (#4353 ) Allowed paths are unified betwen the configuration file parses and the bus property checker. The biggest change is that the bus code now allows "block-" and "char-" classes. In addition, path_startswith("/dev") was used in the bus code, and startswith("/dev") was used in the config file code. It seems reasonable to use path_startswith() which allows a slightly broader class of strings. Fixes #3935.	2016-10-12 11:12:11 +02:00
0xAX	74e7579c17	core/main: get rid from excess check of ACTION_TEST (#4350 ) If `--test` command line option was passed, the systemd set skip_setup to true during bootup. But after this we check again that arg_action is test or help and opens pager depends on result. We should skip setup in a case when `--test` is passed, but it is also safe to set skip_setup in a case of `--help`. So let's remove first check and move skip_setup = true to the second check.	2016-10-11 17:30:04 -04:00
Lennart Poettering	e0d2adfde6	core: chown() any TTY used for stdin, not just when StandardInput=tty is used (#4347 ) If stdin is supplied as an fd for transient units (using the StandardInputFileDescriptor pseudo-property for transient units), then we should also fix up the TTY ownership, not just when we opened the TTY ourselves. This simply drops the explicit is_terminal_input()-based check. Note that chown_terminal() internally does a much more appropriate isatty()-based check anyway, hence we can drop this without replacement. Fixes: #4260	2016-10-11 14:07:22 -04:00
Zbigniew Jędrzejewski-Szmek	b744e8937c	Merge pull request #4067 from poettering/invocation-id Add an "invocation ID" concept to the service manager	2016-10-11 13:40:50 -04:00
Zbigniew Jędrzejewski-Szmek	ec72b96366	Merge pull request #4337 from poettering/exit-code Fix for #4275 and more	2016-10-10 21:24:57 -04:00
Lennart Poettering	052364d41f	core: simplify if branches a bit We do the same thing in two branches, let's merge them. Let's also add an explanatory comment, while we are at it.	2016-10-10 22:57:02 +02:00
Lennart Poettering	f2aed3070d	core: make use of IN_SET() in various places in mount.c	2016-10-10 22:57:02 +02:00
Lennart Poettering	1f0958f640	core: when determining whether a process exit status is clean, consider whether it is a command or a daemon SIGTERM should be considered a clean exit code for daemons (i.e. long-running processes, as a daemon without SIGTERM handler may be shut down without issues via SIGTERM still) while it should not be considered a clean exit code for commands (i.e. short-running processes). Let's add two different clean checking modes for this, and use the right one at the appropriate places. Fixes: #4275	2016-10-10 22:57:01 +02:00
Lennart Poettering	38107f5a4a	core: lower exit status "level" at one place When we print information about PID 1's crashdump subprocess failing. In this case we know that we do not generate LSB exit codes, as it's basically PID 1 itself that exited there.	2016-10-10 22:56:55 +02:00
0xAX	f6dd106c73	main: use strdup instead of free_and_strdup to initialize default unit (#4335 ) Previously we've used free_and_strdup() to fill arg_default_unit with unit name, If we didn't pass default unit name through a kernel command line or command line arguments. But we can use just strdup() instead of free_and_strdup() for this, because we will start fill arg_default_unit only if it wasn't set before.	2016-10-10 22:11:36 +02:00
Lennart Poettering	41e2036eb8	exit-status: kill is_clean_exit_lsb(), move logic to sysv-generator Let's get rid of is_clean_exit_lsb(), let's move the logic for the special handling of the two LSB exit codes into the sysv-generator by writing out appropriate SuccessExitStatus= lines if the LSB header exists. This is not only semantically more correct, bug also fixes a bug as the code in service.c that chose between is_clean_exit_lsb() and is_clean_exit() based this check on whether a native unit files was available for the unit. However, that check was bogus since a long time, since the SysV generator was introduced and native SysV script support was removed from PID 1, as in that case a unit file always existed.	2016-10-10 21:48:08 +02:00
0xAX	c76cf844d6	tree-wide: pass return value of make_null_stdio() to warning instead of errno (#4328 ) as @poettering suggested in the #4320	2016-10-10 19:51:33 +02:00
0xAX	10c961b9c9	main: initialize default unit little later (#4321 ) systemd fills arg_default_unit during startup with default.target value. But arg_default_unit may be overwritten in parse_argv() or parse_proc_cmdline_item(). Let's check value of arg_default_unit after calls of parse_argv() and parse_proc_cmdline_item() and fill it with default.target if it wasn't filled before. In this way we will not spend unnecessary time to for filling arg_default_unit with default.target.	2016-10-09 22:57:03 -04:00
0xAX	9fc932bff1	tree-wide: print warning in a failure case of make_null_stdio() (#4320 ) The make_null_stdio() may fail. Let's check its result and print warning message instead of keeping silence.	2016-10-09 22:55:24 -04:00
Lennart Poettering	4b58153dd2	core: add "invocation ID" concept to service manager This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.	2016-10-07 20:14:38 +02:00
Zbigniew Jędrzejewski-Szmek	8f4d640135	core: only warn on short reads on signal fd	2016-10-07 10:05:04 -04:00
Lennart Poettering	875ca88da5	manager: tighten incoming notification message checks Let's not accept datagrams with embedded NUL bytes. Previously we'd simply ignore everything after the first NUL byte. But given that sending us that is pretty ugly let's instead complain and refuse. With this change we'll only accept messages that have exactly zero or one NUL bytes at the very end of the datagram.	2016-10-07 12:14:33 +02:00
Lennart Poettering	045a3d5989	manager: be stricter with incomining notifications, warn properly about too large ones Let's make the kernel let us know the full, original datagram size of the incoming message. If it's larger than the buffer space provided by us, drop the whole message with a warning. Before this change the kernel would truncate the message for us to the buffer space provided, and we'd not complain about this, and simply process the incomplete message as far as it made sense.	2016-10-07 12:12:10 +02:00
Lennart Poettering	c55ae51e77	manager: don't ever busy loop when we get a notification message we can't process If the kernel doesn't permit us to dequeue/process an incoming notification datagram message it's still better to stop processing the notification messages altogether than to enter a busy loop where we keep getting notified but can't do a thing about it. With this change, manager_dispatch_notify_fd() behaviour is changed like this: - if an error indicating a spurious wake-up is seen on recvmsg(), ignore it (EAGAIN/EINTR) - if any other error is seen on recvmsg() propagate it, thus disabling processing of further wakeups - if any error is seen on later code in the function, warn about it but do not propagate it, as in this cas we're not going to busy loop as the offending message is already dequeued.	2016-10-07 12:08:51 +02:00
Lukáš Nykrýn	24dd31c19e	core: add possibility to set action for ctrl-alt-del burst (#4105 ) For some certification, it should not be possible to reboot the machine through ctrl-alt-delete. Currently we suggest our customers to mask the ctrl-alt-delete target, but that is obviously not enough. Patching the keymaps to disable that is really not a way to go for them, because the settings need to be easily checked by some SCAP tools.	2016-10-06 21:08:21 -04:00
Lennart Poettering	97f0e76f18	user-util: rework maybe_setgroups() a bit Let's drop the caching of the setgroups /proc field for now. While there's a strict regime in place when it changes states, let's better not cache it since we cannot really be sure we follow that regime correctly. More importantly however, this is not in performance sensitive code, and there's no indication the cache is really beneficial, hence let's drop the caching and make things a bit simpler. Also, while we are at it, rework the error handling a bit, and always return negative errno-style error codes, following our usual coding style. This has the benefit that we can sensible hanld read_one_line_file() errors, without having to updat errno explicitly.	2016-10-06 19:04:10 +02:00
Lennart Poettering	2d6fce8d7c	core: leave PAM stub process around with GIDs updated In the process execution code of PID 1, before `096424d123` the GID settings where changed before invoking PAM, and the UID settings after. After the change both changes are made after the PAM session hooks are run. When invoking PAM we fork once, and leave a stub process around which will invoke the PAM session end hooks when the session goes away. This code previously was dropping the remaining privs (which were precisely the UID). Fix this code to do this correctly again, by really dropping them else (i.e. the GID as well). While we are at it, also fix error logging of this code. Fixes: #4238	2016-10-06 19:04:10 +02:00
Giuseppe Scrivano	36d854780c	core: do not fail in a container if we can't use setgroups It might be blocked through /proc/PID/setgroups	2016-10-06 11:49:00 +02:00
Giuseppe Scrivano	77531863ca	Fix typo	2016-10-05 18:36:48 +02:00
Stefan Schweter	629ff674ac	tree-wide: remove consecutive duplicate words in comments	2016-10-04 17:06:25 +02:00
Michael Olbrich	c080fbce9c	automount: make sure the expire event is restarted after a daemon-reload (#4265 ) If the corresponding mount unit is deserialized after the automount unit then the expire event is set up in automount_trigger_notify(). However, if the mount unit is deserialized first then the automount unit is still in state AUTOMOUNT_DEAD and automount_trigger_notify() aborts without setting up the expire event. Explicitly call automount_start_expire() during coldplug to make sure that the expire event is set up as necessary. Fixes #4249.	2016-10-04 16:13:27 +02:00
Zbigniew Jędrzejewski-Szmek	a63ee40751	core: do not try to create /run/systemd/transient in test mode This prevented systemd-analyze from unprivileged operation on older systemd installations, which should be possible. Also, we shouldn't touch the file system in test mode even if we can.	2016-10-01 22:53:17 +02:00
Zbigniew Jędrzejewski-Szmek	dd5e7000cb	core: complain if Before= dep on .device is declared [Unit] Before=foobar.device [Service] ExecStart=/bin/true Type=oneshot $ systemd-analyze verify before-device.service before-device.service: Dependency Before=foobar.device ignored (.device units cannot be delayed)	2016-10-01 22:53:17 +02:00
Zbigniew Jędrzejewski-Szmek	5fd2c135f1	core: update warning message "closing all" might suggest that _all_ fds received with the notification message will be closed. Reword the message to clarify that only the "unused" ones will be closed.	2016-10-01 11:01:31 +02:00
Zbigniew Jędrzejewski-Szmek	c4bee3c40e	core: get rid of unneeded state variable No functional change.	2016-10-01 11:01:31 +02:00
Zbigniew Jędrzejewski-Szmek	a86b76753d	pid1: more informative error message for ignored notifications It's probably easier to diagnose a bad notification message if the contents are printed. But still, do anything only if debugging is on.	2016-09-29 22:57:57 +02:00
Zbigniew Jędrzejewski-Szmek	8523bf7dd5	pid1: process zero-length notification messages again This undoes `531ac2b234`. I acked that patch without looking at the code carefully enough. There are two problems: - we want to process the fds anyway - in principle empty notification messages are valid, and we should process them as usual, including logging using log_unit_debug().	2016-09-29 22:57:57 +02:00
Franck Bui	9987750e7a	pid1: don't return any error in manager_dispatch_notify_fd() (#4240 ) If manager_dispatch_notify_fd() fails and returns an error then the handling of service notifications will be disabled entirely leading to a compromised system. For example pid1 won't be able to receive the WATCHDOG messages anymore and will kill all services supposed to send such messages.	2016-09-29 19:44:34 +02:00
Jorge Niedbalski	531ac2b234	If the notification message length is 0, ignore the message (#4237 ) Fixes #4234. Signed-off-by: Jorge Niedbalski <jnr@metaklass.org>	2016-09-29 05:26:16 -04:00
Evgeny Vereshchagin	cc238590e4	Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1 core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes	2016-09-28 04:50:30 +03:00
Paweł Szewczyk	00bb64ecfa	core: Fix USB functionfs activation and clarify its documentation (#4188 ) There was no certainty about how the path in service file should look like for usb functionfs activation. Because of this it was treated differently in different places, which made this feature unusable. This patch fixes the path to be the mount directory of functionfs, not ep0 file path and clarifies in the documentation that ListenUSBFunction should be the location of functionfs mount point, not ep0 file itself.	2016-09-26 18:45:47 +02:00
Djalal Harouni	8f81a5f61b	core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.	2016-09-25 12:52:27 +02:00
Djalal Harouni	b6c432ca7e	core:namespace: simplify ProtectHome= implementation As with previous patch simplify ProtectHome and don't care about duplicates, they will be sorted by most restrictive mode and cleaned.	2016-09-25 12:41:16 +02:00
Djalal Harouni	f471b2afa1	core: simplify ProtectSystem= implementation ProtectSystem= with all its different modes and other options like PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal, however currently it's a bit hard to parse that from the implementation view. Simplify it by giving each mode its own table with all paths and references to other Protect options. With this change some entries are duplicated, but we do not care since duplicate mounts are first sorted by the most restrictive mode then cleaned.	2016-09-25 12:21:25 +02:00
Djalal Harouni	49accde7bd	core:sandbox: add more /proc/* entries to ProtectKernelTunables= Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface, filesystems configuration and IRQ tuning readonly. Most of these interfaces now days should be in /sys but they are still available through /proc, so just protect them. This patch does not touch /proc/net/...	2016-09-25 11:30:11 +02:00
Djalal Harouni	2652c6c103	core:namespace: simplify mount calculation Move out mount calculation on its own function. Actually the logic is smart enough to later drop nop and duplicates mounts, this change improves code readability. --- src/core/namespace.c \| 47 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 11 deletions(-)	2016-09-25 11:25:00 +02:00
Djalal Harouni	11a30cec2a	core:namespace: put paths protected by ProtectKernelTunables= in Instead of having all these paths everywhere, put the ones that are protected by ProtectKernelTunables= into their own table. This way it is easy to add paths and track which ones are protected.	2016-09-25 11:16:44 +02:00
Djalal Harouni	9c94d52e09	core:namespace: minor improvements to append_mounts()	2016-09-25 11:03:21 +02:00
Lennart Poettering	cefc33aee2	execute: move SMACK setup code into its own function While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.	2016-09-25 10:52:57 +02:00
Lennart Poettering	cd2902c954	namespace: drop all mounts outside of the new root directory There's no point in mounting these, if they are outside of the root directory we'll move to.	2016-09-25 10:52:57 +02:00
Lennart Poettering	54500613a4	main: minor simplification	2016-09-25 10:52:57 +02:00
Lennart Poettering	ba128bb809	execute: filter low-level I/O syscalls if PrivateDevices= is set If device access is restricted via PrivateDevices=, let's also block the various low-level I/O syscalls at the same time, so that we know that the minimal set of devices in our virtualized /dev are really everything the unit can access.	2016-09-25 10:52:57 +02:00
Lennart Poettering	8f1ad200f0	namespace: don't make the root directory of a namespace a mount if it already is one Let's not stack mounts needlessly.	2016-09-25 10:42:18 +02:00
Lennart Poettering	d944dc9553	namespace: chase symlinks for mounts to set up in userspace This adds logic to chase symlinks for all mount points that shall be created in a namespace environment in userspace, instead of leaving this to the kernel. This has the advantage that we can correctly handle absolute symlinks that shall be taken relative to a specific root directory. Moreover, we can properly handle mounts created on symlinked files or directories as we can merge their mounts as necessary. (This also drops the "done" flag in the namespace logic, which was never actually working, but was supposed to permit a partial rollback of the namespace logic, which however is only mildly useful as it wasn't clear in which case it would or would not be able to roll back.) Fixes: #3867	2016-09-25 10:42:18 +02:00
Lennart Poettering	1e4e94c881	namespace: invoke unshare() only after checking all parameters Let's create the new namespace only after we validated and processed all parameters, right before we start with actually mounting things. This way, the window where we can roll back is larger (not that it matters IRL...)	2016-09-25 10:42:18 +02:00
Lennart Poettering	096424d123	execute: drop group priviliges only after setting up namespace If PrivateDevices=yes is set, the namespace code creates device nodes in /dev that should be owned by the host's root, hence let's make sure we set up the namespace before dropping group privileges.	2016-09-25 10:42:18 +02:00
Lennart Poettering	63bb64a056	core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1 Let's make sure that services that use DynamicUser=1 cannot leave files in the file system should the system accidentally have a world-writable directory somewhere. This effectively ensures that directories need to be whitelisted rather than blacklisted for access when DynamicUser=1 is set.	2016-09-25 10:42:18 +02:00
Lennart Poettering	3f815163ff	core: introduce ProtectSystem=strict Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a new setting "strict". If set, the entire directory tree of the system is mounted read-only, but the API file systems /proc, /dev, /sys are excluded (they may be managed with PrivateDevices= and ProtectKernelTunables=). Also, /home and /root are excluded as those are left for ProtectHome= to manage. In this mode, all "real" file systems (i.e. non-API file systems) are mounted read-only, and specific directories may only be excluded via ReadWriteDirectories=, thus implementing an effective whitelist instead of blacklist of writable directories. While we are at, also add /efi to the list of paths always affected by ProtectSystem=. This is a follow-up for `b52a109ad3` which added /efi as alternative for /boot. Our namespacing logic should respect that too.	2016-09-25 10:42:18 +02:00
Lennart Poettering	160cfdbed3	namespace: add some debug logging when enforcing InaccessiblePaths=	2016-09-25 10:42:18 +02:00
Lennart Poettering	6b7c9f8bce	namespace: rework how ReadWritePaths= is applied Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths= specification, then we'd first recursively apply the ReadOnlyPaths= paths, and make everything below read-only, only in order to then flip the read-only bit again for the subdirs listed in ReadWritePaths= below it. This is not only ugly (as for the dirs in question we first turn on the RO bit, only to turn it off again immediately after), but also problematic in containers, where a container manager might have marked a set of dirs read-only and this code will undo this is ReadWritePaths= is set for any. With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be applied to the children listed in ReadWritePaths= in the first place, so that we do not need to turn off the RO bit for those after all. This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the RO bit, but never to turn it off again. Or to say this differently: if some dirs are marked read-only via some external tool, then ReadWritePaths= will not undo it. This is not only the safer option, but also more in-line with what the man page currently claims: "Entries (files or directories) listed in ReadWritePaths= are accessible from within the namespace with the same access rights as from outside." To implement this change bind_remount_recursive() gained a new "blacklist" string list parameter, which when passed may contain subdirs that shall be excluded from the read-only mounting. A number of functions are updated to add more debug logging to make this more digestable.	2016-09-25 10:40:51 +02:00
Lennart Poettering	7648a565d1	namespace: when enforcing fs namespace restrictions suppress redundant mounts If /foo is marked to be read-only, and /foo/bar too, then the latter may be suppressed as it has no effect.	2016-09-25 10:19:15 +02:00
Lennart Poettering	6ee1a919cf	namespace: simplify mount_path_compare() a bit	2016-09-25 10:19:10 +02:00
Lennart Poettering	3fbe8dbe41	execute: if RuntimeDirectory= is set, it should be writable Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.	2016-09-25 10:19:05 +02:00
Lennart Poettering	be39ccf3a0	execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.	2016-09-25 10:18:57 +02:00
Lennart Poettering	07689d5d2c	execute: split out creation of runtime dirs into its own functions	2016-09-25 10:18:54 +02:00
Lennart Poettering	fe3c2583be	namespace: make sure InaccessibleDirectories= masks all mounts further down If a dir is marked to be inaccessible then everything below it should be masked by it.	2016-09-25 10:18:51 +02:00
Lennart Poettering	59eeb84ba6	core: add two new service settings ProtectKernelTunables= and ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.	2016-09-25 10:18:48 +02:00
Lennart Poettering	72246c2a65	core: enforce seccomp for secondary archs too, for all rules Let's make sure that all our rules apply to all archs the local kernel supports.	2016-09-25 10:18:44 +02:00
Zbigniew Jędrzejewski-Szmek	43688c49d1	tree-wide: rename config_parse_many to …_nulstr In preparation for adding a version which takes a strv.	2016-09-16 10:32:03 -04:00
Evgeny Vereshchagin	47af450af0	Merge pull request #4119 from keszybz/drop-more-kdbus Drop more kdbus functionality	2016-09-10 09:26:43 +03:00
Kyle Russell	7dd736abec	service: fixup ExecStop for socket-activated shutdown (#4120 ) Previous fix didn't consider handling multiple ExecStop commands.	2016-09-10 08:55:36 +03:00
Michael Olbrich	0dd99f86ad	unit: sent change signal before removing the unit if necessary (#4106 ) If the unit is in the dbus queue when it is removed then the last change signal is never sent. Fix this by checking the dbus queue and explicitly send the change signal before sending the remove signal.	2016-09-09 16:05:06 +01:00
Zbigniew Jędrzejewski-Szmek	232f6754f6	pid1: drop kdbus_fd and all associated logic	2016-09-09 15:16:26 +01:00
Kyle Russell	f2dbd059a6	service: Continue shutdown on socket activated unit on termination (#4108 ) ENOTCONN may be a legitimate return code if the endpoint disappeared, but the service should still attempt to shutdown cleanly.	2016-09-09 05:34:43 +03:00
Felipe Sateler	d347d9029c	seccomp: also detect if seccomp filtering is enabled In https://github.com/systemd/systemd/pull/4004 , a runtime detection method for seccomp was added. However, it does not detect the case where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible if the architecture does not support filtering yet. Add a check for that case too. While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl, as that should save a few system calls and (unnecessary) allocations. Previously, reading of /proc/self/stat was done as recommended by prctl(2) as safer. However, given that we need to do the prctl call anyway, lets skip opening, reading and parsing the file. Code for checking inspired by https://outflux.net/teach-seccomp/autodetect.html	2016-09-06 20:25:49 -03:00
Lennart Poettering	cf08b48642	core: introduce MemorySwapMax= (#3659 ) Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls controls "memory.swap.max" attribute in unified cgroup.	2016-08-31 12:28:54 +02:00
Lennart Poettering	126c6aedb8	load-fragment: Resolve specifiers in OnCalendar and On*Sec (#4045 ) Resolves #3534	2016-08-31 12:07:39 +02:00
WaLyong Cho	96e131ea09	core: introduce MemorySwapMax= Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls controls "memory.swap.max" attribute in unified cgroup.	2016-08-30 11:11:45 +09:00
Barron Rulon	49915de245	mount: add SloppyOptions= to mount_dump()	2016-08-27 10:47:46 -04:00
Barron Rulon	4f8d40a9dc	mount: add new ForceUnmount= setting for mount units, mapping to umount(8)'s "-f" switch	2016-08-27 10:46:52 -04:00
Douglas Christman	2507992f6b	load-fragment: Resolve specifiers in OnCalendar and On*Sec Resolves #3534	2016-08-26 12:13:16 -04:00
brulon	e520950a03	mount: add new LazyUnmount= setting for mount units, mapping to umount(8)'s "-l" switch (#3827 )	2016-08-26 17:57:22 +02:00
Evgeny Vereshchagin	6afe14ff5b	Merge pull request #3984 from poettering/refcnt permit bus clients to pin units to avoid automatic GC	2016-08-26 16:17:05 +03:00
Felipe Sateler	8dec4a9d2d	core,network: Use const qualifiers for block-local variables in macro functions (#4019 ) Prevents discard-qualifiers warnings when the passed variable was const	2016-08-23 12:29:30 +03:00
Felipe Sateler	83f12b27d1	core: do not fail at step SECCOMP if there is no kernel support (#4004 ) Fixes #3882	2016-08-22 22:40:58 +03:00
Lennart Poettering	390bc2b149	core: let's use set_contains() where appropriate	2016-08-22 16:14:21 +02:00
Lennart Poettering	fe700f46ec	core: cache last CPU usage counter, before destorying a cgroup It is useful for clients to be able to read the last CPU usage counter value of a unit even if the unit is already terminated. Hence, before destroying a cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is gone.	2016-08-22 16:14:21 +02:00
Lennart Poettering	05a98afd3e	core: add Ref()/Unref() bus calls for units This adds two (privileged) bus calls Ref() and Unref() to the Unit interface. The two calls may be used by clients to pin a unit into memory, so that various runtime properties aren't flushed out by the automatic GC. This is necessary to permit clients to race-freely acquire runtime results (such as process exit status/code or accumulated CPU time) on successful service termination. Ref() and Unref() are fully recursive, hence act like the usual reference counting concept in C. Taking a reference is a privileged operation, as this allows pinning units into memory which consumes resources. Transient units may also gain a reference at the time of creation, via the new AddRef property (that is only defined for transient units at the time of creation).	2016-08-22 16:14:21 +02:00
Zbigniew Jędrzejewski-Szmek	2056ec1927	Merge pull request #3965 from htejun/systemd-controller-on-unified	2016-08-19 19:58:01 -04:00
Lennart Poettering	16d901e251	Merge pull request #3987 from keszybz/console-color-setup Rework console color setup	2016-08-19 19:36:09 +02:00
Lennart Poettering	cbf138ebef	Merge pull request #3988 from keszybz/journald-dynamic-users Journald dynamic users	2016-08-19 10:41:26 +02:00
Zbigniew Jędrzejewski-Szmek	61755fdae0	journald: do not create split journals for dynamic users Dynamic users should be treated like system users, and their logs should end up in the main system journal.	2016-08-18 23:34:40 -04:00
Zbigniew Jędrzejewski-Szmek	986a34a683	core/dynamic-users: warn when creation of symlinks for dynamic users fails Also return the first error, since it's most likely to be interesting. If unlink fails, symlink will usually return EEXIST.	2016-08-18 23:09:29 -04:00
Tejun Heo	f50582649f	logind: update empty and "infinity" handling for [User]TasksMax (#3835 ) The parsing functions for [User]TasksMax were inconsistent. Empty string and "infinity" were interpreted as no limit for TasksMax but not accepted for UserTasksMax. Update them so that they're consistent with other knobs. * Empty string indicates the default value. * "infinity" indicates no limit. While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax handling. v2: Update empty string to indicate the default value as suggested by Zbigniew Jędrzejewski-Szmek. v3: Fixed empty UserTasksMax handling.	2016-08-18 22:57:53 -04:00
Zbigniew Jędrzejewski-Szmek	bd64d82c1c	Revert "pid1: reconnect to the console before being re-executed" This reverts commit `affd7ed1a9`. > So it looks like make_console_stdio() has bad side effect. More specifically it > does a TIOCSCTTY ioctl (via acquire_terminal()) which sees to disturb the > process which was using/owning the console. Fixes #3842. https://bugs.debian.org/834367 https://bugzilla.redhat.com/show_bug.cgi?id=1367766	2016-08-18 22:30:15 -04:00
Zbigniew Jędrzejewski-Szmek	206fc4b284	systemd: warn when setrlimit fails This should make it easier to figure things out.	2016-08-18 22:29:56 -04:00
Lennart Poettering	fd63e712b2	core: bypass dynamic user lookups from dbus-daemon dbus-daemon does NSS name look-ups in order to enforce its bus policy. This might dead-lock if an NSS module use wants to use D-Bus for the look-up itself, like our nss-systemd does. Let's work around this by bypassing bus communication in the NSS module if we run inside of dbus-daemon. To make this work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't have to consult the bus, but can still resolve the names. Note that the normal codepath continues to be via the bus, so that resolving works from all mount namespaces and is subject to authentication, as before. This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway for PID 1.	2016-08-19 00:50:24 +02:00
Lennart Poettering	00d9ef8560	core: add RemoveIPC= setting This adds the boolean RemoveIPC= setting to service, socket, mount and swap units (i.e. all unit types that may invoke processes). if turned on, and the unit's user/group is not root, all IPC objects of the user/group are removed when the service is shut down. The life-cycle of the IPC objects is hence bound to the unit life-cycle. This is particularly relevant for units with dynamic users, as it is essential that no objects owned by the dynamic users survive the service exiting. In fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set. In order to communicate the UID/GID of an executed process back to PID 1 this adds a new "user lookup" socket pair, that is inherited into the forked processes, and closed before the exec(). This is needed since we cannot do NSS from PID 1 due to deadlock risks, However need to know the used UID/GID in order to clean up IPC owned by it if the unit shuts down.	2016-08-19 00:37:25 +02:00
Lennart Poettering	51d73fd96a	core: move obsolete properties to the end of vtables This makes it easier to discern the relevant and obsolete parts of the vtables, and in particular helps when comparing introspection data with the actual vtable definitions.	2016-08-18 22:49:48 +02:00
Lennart Poettering	92b25bcabb	core: make use of uid_is_valid() when checking for UID validity	2016-08-18 22:49:48 +02:00
Lennart Poettering	b4c990e91b	unit: remove orphaned cgroup_netclass_id field	2016-08-18 22:49:48 +02:00
Tejun Heo	5da38d0768	core: use the unified hierarchy for the systemd cgroup controller hierarchy Currently, systemd uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, systemd uses a named legacy hierarchy mounted on /sys/fs/cgroup/systemd without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for systemd to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates systemd so that it prefers the unified hierarchy for the systemd cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for systemd cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only systemd controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().	2016-08-17 17:44:36 -04:00
Tejun Heo	ca2f6384aa	core: rename cg_unified() to cg_all_unified() A following patch will update cgroup handling so that the systemd controller (/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the systemd controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.	2016-08-15 18:13:36 -04:00
Zbigniew Jędrzejewski-Szmek	5f9a610ad2	Merge pull request #3905 from htejun/cgroup-v2-cpu core: add cgroup CPU controller support on the unified hierarchy (zj: merging not squashing to make it clear against which upstream this patch was developed.)	2016-08-14 18:03:35 -04:00
Zbigniew Jędrzejewski-Szmek	87da8a864f	core: amend policy to open up dynamic user queries (#3920 )	2016-08-08 23:39:16 +02:00
Tejun Heo	66ebf6c0a1	core: add cgroup CPU controller support on the unified hierarchy Unfortunately, due to the disagreements in the kernel development community, CPU controller cgroup v2 support has not been merged and enabling it requires applying two small out-of-tree kernel patches. The situation is explained in the following documentation. https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu While it isn't clear what will happen with CPU controller cgroup v2 support, there are critical features which are possible only on cgroup v2 such as buffered write control making cgroup v2 essential for a lot of workloads. This commit implements systemd CPU controller support on the unified hierarchy so that users who choose to deploy CPU controller cgroup v2 support can easily take advantage of it. On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max" replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config options are added with the usual compat translation. CPU quota settings remain unchanged and apply to both legacy and unified hierarchies. v2: - Error in man page corrected. - CPU config application in cgroup_context_apply() refactored. - CPU accounting now works on unified hierarchy.	2016-08-07 09:45:39 -04:00
Zbigniew Jędrzejewski-Szmek	d87a2ef782	Merge pull request #3884 from poettering/private-users	2016-08-06 17:04:45 -04:00
Zbigniew Jędrzejewski-Szmek	3bb81a80bd	Merge pull request #3818 from poettering/exit-status-env beef up /var/tmp and /tmp handling; set $SERVICE_RESULT/$EXIT_CODE/$EXIT_STATUS on ExecStop= and make sure root/nobody are always resolvable	2016-08-05 20:55:08 -04:00
Lennart Poettering	ceab9e2dee	Merge pull request #3900 from keszybz/fix-3607 Fix 3607	2016-08-05 17:03:09 +02:00
Zbigniew Jędrzejewski-Szmek	80a58668d9	socket: add helper function to remove code duplication	2016-08-05 08:24:00 -04:00
Zbigniew Jędrzejewski-Szmek	ea8f50f808	core/socket: include remote address in the message when dropping connection Without the address the message is not very useful. Aug 04 23:52:21 rawhide systemd[1]: testlimit.socket: Too many incoming connections (4) from source ::1, dropping connection.	2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek	3ebcd323bd	systemd: do not serialize peer, bump count when deserializing socket instead	2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek	9dfb64f87d	core/service: serialize and deserialize accept_socket This fixes an issue during reexec — the count of connections would be lost: [zbyszek@fedora-rawhide ~]$ systemctl status testlimit.socket \| grep Connected Accepted: 1; Connected: 1 [zbyszek@fedora-rawhide ~]$ sudo systemctl daemon-reexec [zbyszek@fedora-rawhide ~]$ systemctl status testlimit.socket \| grep Connected Accepted: 1; Connected: 0 With the patch, Connected count is preserved. Also add "Accept Socket" to the dump output for services.	2016-08-05 08:16:31 -04:00
Zbigniew Jędrzejewski-Szmek	166cf510c2	core/socket: rework SocketPeer refcounting Make functions and definitions that don't need to be shared local to socket.c.	2016-08-05 08:12:31 -04:00
Lennart Poettering	41bf0590cc	util-lib: unify parsing of nice level values This adds parse_nice() that parses a nice level and ensures it is in the right range, via a new nice_is_valid() helper. It then ports over a number of users to this. No functional changes.	2016-08-05 11:18:32 +02:00
Zbigniew Jędrzejewski-Szmek	9a73653c3e	systemd: convert peers_by_address to a set	2016-08-04 23:53:07 -04:00
Lennart Poettering	b08af3b127	core: only set the watchdog variables in ExecStart= lines	2016-08-04 23:08:05 +02:00
Lennart Poettering	a0fef983ab	core: remember first unit failure, not last unit failure Previously, the result value of a unit was overriden with each failure that took place, so that the result always reported the last failure that took place. With this commit this is changed, so that the first failure taking place is stored instead. This should normally not matter much as multiple failures are sufficiently uncommon. However, it improves one behaviour: if we send SIGABRT to a service due to a watchdog timeout, then this currently would be reported as "coredump" failure, rather than the "watchodg" failure it really is. Hence, in order to report information about the type of the failure, and not about the effect of it, let's change this from all unit type to store the first, not the last failure. This addresses the issue pointed out here: https://github.com/systemd/systemd/pull/3818#discussion_r73433520	2016-08-04 23:08:05 +02:00
Lennart Poettering	136dc4c435	core: set $SERVICE_RESULT, $EXIT_CODE and $EXIT_STATUS in ExecStop=/ExecStopPost= commands This should simplify monitoring tools for services, by passing the most basic information about service result/exit information via environment variables, thus making it unnecessary to retrieve them explicitly via the bus.	2016-08-04 23:08:05 +02:00
0xAX	13811bf909	main: use pager for --dump-configuration-items (#3894 )	2016-08-04 22:52:24 +02:00
Lennart Poettering	af9d16e10a	core: use the correct APIs to determine whether a dual timestamp is initialized	2016-08-04 16:27:07 +02:00
Lennart Poettering	9c1a61adba	core: move masking of chroot/permission masking into service_spawn() Let's fix up the flags fields in service_spawn() rather than its callers, in order to simplify things a bit.	2016-08-04 16:27:07 +02:00
Lennart Poettering	c39f1ce24d	core: turn various execution flags into a proper flags parameter The ExecParameters structure contains a number of bit-flags, that were so far exposed as bool:1, change this to a proper, single binary bit flag field. This makes things a bit more expressive, and is helpful as we add more flags, since these booleans are passed around in various callers, for example service_spawn(), whose signature can be made much shorter now. Not all bit booleans from ExecParameters are moved into the flags field for now, but this can be added later.	2016-08-04 16:27:07 +02:00
Lennart Poettering	eb18df724b	Merge pull request #2471 from michaelolbrich/transient-mounts allow transient mounts and automounts	2016-08-04 16:16:04 +02:00
David Michael	5124866d73	util-lib: add parse_percent_unbounded() for percentages over 100% (#3886 ) This permits CPUQuota to accept greater values as documented.	2016-08-04 13:09:54 +02:00
Lennart Poettering	d251207d55	core: add new PrivateUsers= option to service execution This setting adds minimal user namespacing support to a service. When set the invoked processes will run in their own user namespace. Only a trivial mapping will be set up: the root user/group is mapped to root, and the user/group of the service will be mapped to itself, everything else is mapped to nobody. If this setting is used the service runs with no capabilities on the host, but configurable capabilities within the service. This setting is particularly useful in conjunction with RootDirectory= as the need to synchronize /etc/passwd and /etc/group between the host and the service OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the user of the service itself. But even outside the RootDirectory= case this setting is useful to substantially reduce the attack surface of a service. Example command to test this: systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh This runs a shell as user "foobar". When typing "ps" only processes owned by "root", by "foobar", and by "nobody" should be visible.	2016-08-03 20:42:04 +02:00
Lennart Poettering	7049382803	execute: don't set $SHELL and $HOME for services, if they don't contain interesting data	2016-08-03 14:52:16 +02:00
Lennart Poettering	6af760f3b2	core: inherit TERM from PID 1 for all services started on /dev/console This way, invoking nspawn from a shell in the best case inherits the TERM setting all the way down into the login shell spawned in the container. Fixes: #3697	2016-08-03 14:52:16 +02:00
Lennart Poettering	43992e57e0	core: drop spurious newline	2016-08-03 14:52:16 +02:00
Susant Sahani	9d56542764	socket: add support to control no. of connections from one source (#3607 ) Introduce MaxConnectionsPerSource= that is number of concurrent connections allowed per IP. RFE: 1939	2016-08-02 13:48:23 -04:00
Ismo Puustinen	96694e998b	main: load Smack policy before IMA policy (#3859 ) IMA wiki says: "If the IMA policy contains LSM labels, then the LSM policy must be loaded prior to the IMA policy." Right now, in case of Smack, the IMA policy is loaded before the Smack policy. Move the order around to allow Smack labels to be used in IMA policy.	2016-08-02 08:58:30 -04:00
0xAX	494294d6f8	main: get rid of ACTION_DONE (#3849 ) the ACTION_DONE was introduced in the `4288f61921` (dbus: automatically generate and install introspection files ) commit and was used in systemd --introspect command. Later 'introspect' command was removed in the `ca2871d9b` (bus: remove static introspection file export) commit and have no users anymore. So we can remove it.	2016-08-01 12:38:25 +02:00
Zbigniew Jędrzejewski-Szmek	dadd6ecfa5	Merge pull request #3728 from poettering/dynamic-users	2016-07-25 16:40:26 -04:00
Michael Olbrich	87d41d6244	automount: don't cancel mount/umount request on reload/reexec (#3670 ) All pending tokens are already serialized correctly and will be handled when the mount unit is done. Without this a 'daemon-reload' cancels all pending tokens. Any process waiting for the mount will continue with EHOSTDOWN. This can happen when the mount unit waits for it's dependencies, e.g. network, devices, fsck, etc.	2016-07-25 20:04:02 +02:00
Michael Olbrich	2de0b9e913	transaction: don't cancel jobs for units with IgnoreOnIsolate=true (#3671 ) This is important if a job was queued for a unit but not yet started. Without this, the job will be canceled and is never executed even though IgnoreOnIsolate it set to 'true'.	2016-07-25 20:02:55 +02:00
Lennart Poettering	43eb109aa9	core: change ExecStart=! syntax to ExecStart=+ (#3797 ) As suggested by @mbiebl we already use the "!" special char in unit file assignments for negation, hence we should not use it in a different context for privileged execution. Let's use "+" instead.	2016-07-25 16:53:33 +02:00
Zbigniew Jędrzejewski-Szmek	31b14fdb6f	Merge pull request #3777 from poettering/id128-rework uuid/id128 code rework	2016-07-22 21:18:41 -04:00
Lennart Poettering	5052c4eadd	Merge pull request #3753 from poettering/tasks-max-scale Add support for relative TasksMax= specifications, and bump default for services	2016-07-22 17:40:12 +02:00
Alessandro Puccetti	0d9e799102	cgroup: whitelist inaccessible devices for "auto" and "closed" DevicePolicy. https://github.com/systemd/systemd/pull/3685 introduced /run/systemd/inaccessible/{chr,blk} to map inacessible devices, this patch allows systemd running inside a nspawn container to create /run/systemd/inaccessible/{chr,blk}.	2016-07-22 16:08:31 +02:00
Lennart Poettering	409093fe10	nss: add new "nss-systemd" NSS module for mapping dynamic users With this NSS module all dynamic service users will be resolvable via NSS like any real user.	2016-07-22 15:53:45 +02:00
Lennart Poettering	6f3e79859d	core: enforce user/group name validity also when creating transient units	2016-07-22 15:53:45 +02:00
Lennart Poettering	29206d4619	core: add a concept of "dynamic" user ids, that are allocated as long as a service is running This adds a new boolean setting DynamicUser= to service files. If set, a new user will be allocated dynamically when the unit is started, and released when it is stopped. The user ID is allocated from the range 61184..65519. The user will not be added to /etc/passwd (but an NSS module to be added later should make it show up in getent passwd). For now, care should be taken that the service writes no files to disk, since this might result in files owned by UIDs that might get assigned dynamically to a different service later on. Later patches will tighten sandboxing in order to ensure that this cannot happen, except for a few selected directories. A simple way to test this is: systemd-run -p DynamicUser=1 /bin/sleep 99999	2016-07-22 15:53:45 +02:00
Lennart Poettering	66dccd8d85	core: be stricter when parsing User=/Group= fields Let's verify the validity of the syntax of the user/group names set.	2016-07-22 15:53:45 +02:00
Lennart Poettering	b3785cd5e6	core: check for overflow when handling scaled MemoryLimit= settings Just in case...	2016-07-22 15:33:13 +02:00
Harald Hoyer	2424b6bd71	macros.systemd.in: add %systemd_ordering (#3776 ) To remove the hard dependency on systemd, for packages, which function without a running systemd the %systemd_ordering macro can be used to ensure ordering in the rpm transaction. %systemd_ordering makes sure, the systemd rpm is installed prior to the package, so the %pre/%post scripts can execute the systemd parts. Installing systemd afterwards though, does not result in the same outcome.	2016-07-22 09:33:13 -04:00
Lennart Poettering	79baeeb96d	core: change TasksMax= default for system services to 15% As it turns out 512 is max number of tasks per service is hit by too many applications, hence let's bump it a bit, and make it relative to the system's maximum number of PIDs. With this change the new default is 15%. At the kernel's default pids_max value of 32768 this translates to 4915. At machined's default TasksMax= setting of 16384 this translates to 2457. Why 15%? Because it sounds like a round number and is close enough to 4096 which I was going for, i.e. an eight-fold increase over the old 512 Summary: \| on the host \| in a container old default \| 512 \| 512 new default \| 4915 \| 2457	2016-07-22 15:33:13 +02:00
Lennart Poettering	84af7821b6	main: simplify things a bit by moving container check into fixup_environment()	2016-07-22 15:33:12 +02:00
Lennart Poettering	f7903e8db6	core: rename MemoryLimitByPhysicalMemory transient property to MemoryLimitScale That way, we can neatly keep this in line with the new TasksMaxScale= option. Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet, hence this change should be unproblematic without breaking API.	2016-07-22 15:33:12 +02:00
Lennart Poettering	83f8e80857	core: support percentage specifications on TasksMax= This adds support for a TasksMax=40% syntax for specifying values relative to the system's configured maximum number of processes. This is useful in order to neatly subdivide the available room for tasks within containers.	2016-07-22 15:33:12 +02:00
Lennart Poettering	4b1afed01f	core: rework machine-id-setup.c to use the calls from id128-util.[ch] This allows us to delete quite a bit of code and make the whole thing a lot shorter.	2016-07-22 12:59:36 +02:00
Lennart Poettering	e042eab720	main: make sure set_machine_id() doesn't clobber arg_machine_id on failure	2016-07-22 12:59:36 +02:00
Lennart Poettering	15b1248a6b	machine-id-setup: port machine_id_commit() to new id128-util.c APIs	2016-07-22 12:59:36 +02:00
Lennart Poettering	910fd145f4	sd-id128: split UUID file read/write code into new id128-util.[ch] We currently have code to read and write files containing UUIDs at various places. Unify this in id128-util.[ch], and move some other stuff there too. The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/), because they are actually the backend of sd_id128_get_machine() and sd_id128_get_boot(). In follow-up patches we can use this reduce the code in nspawn and machine-id-setup by adopted the common implementation.	2016-07-22 12:59:36 +02:00
Martin Pitt	bf3dd08a81	Merge pull request #3762 from poettering/sigkill-log log about all processes we forcibly kill	2016-07-22 09:18:30 +02:00
Martin Pitt	5c3c778014	Merge pull request #3764 from poettering/assorted-stuff-2 Assorted fixes	2016-07-22 09:10:04 +02:00
Alessandro Puccetti	31d28eabc1	nspawn: enable major=0/minor=0 devices inside the container (#3773 ) https://github.com/systemd/systemd/pull/3685 introduced /run/systemd/inaccessible/{chr,blk} to map inacessible devices, this patch allows systemd running inside a nspawn container to create /run/systemd/inaccessible/{chr,blk}.	2016-07-21 17:39:38 +02:00
Thomas H. P. Andersen	f8298f7be3	core: remove duplicate includes (#3771 )	2016-07-21 10:52:07 +02:00
Topi Miettinen	176e51b710	namespace: fix wrong return value from mount(2) (#3758 ) Fix bug introduced by #3263: mount(2) return value is 0 or -1, not errno. Thanks to Evgeny Vereshchagin (@evverx) for reporting.	2016-07-20 17:43:21 +03:00
Lennart Poettering	33df919d5c	execute: make sure JoinsNamespaceOf= doesn't leak ns fds to executed processes	2016-07-20 14:53:15 +02:00
Lennart Poettering	fe048ce56a	namespace: add a (void) cast	2016-07-20 14:53:15 +02:00
Lennart Poettering	9ce9347880	core: normalize header inclusion in execute.h a bit We don't actually need any functionality from cgroup.h in execute.h, hence don't include that. However, we do need the Unit structure from unit.h, hence include that, and move it as late as possible, since it needs the definitions from execute.h.	2016-07-20 14:53:15 +02:00
Lennart Poettering	7a1ab780c4	execute: normalize connect_logger_as() parameters slightly All other functions in execute.c that need the unit id take a Unit* parameter as first argument. Let's change connect_logger_as() to follow a similar logic.	2016-07-20 14:53:15 +02:00
Lennart Poettering	3862e809d0	core: when a scope was abandoned, always log about processes we kill After all, if a unit is abandoned, all processes inside of it may be considered "left over" and are something we should better log about.	2016-07-20 14:35:15 +02:00
Lennart Poettering	f4b0fb236b	core: make sure RequestStop signal is send directed This was accidentally left commented out for debugging purposes, let's fix that and make the signal directed again.	2016-07-20 14:35:15 +02:00
Lennart Poettering	1d98fef17d	core: when forcibly killing/aborting left-over unit processes log about it Let's lot at LOG_NOTICE about any processes that we are going to SIGKILL/SIGABRT because clean termination of them didn't work. This turns the various boolean flag parameters to cg_kill(), cg_migrate() and related calls into a single binary flags parameter, simply because the function now gained even more parameters and the parameter listed shouldn't get too long. Logging for killing processes is done either when the kill signal is SIGABRT or SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE is passed. This isn't used yet in this patch, but is made use of in a later patch.	2016-07-20 14:35:15 +02:00
Lennart Poettering	5fd7cf6fe2	namespace: minor improvements We generally try to avoid strerror(), due to its threads-unsafety, let's do this here, too. Also, let's be tiny bit more explanatory with the log messages, and let's shorten a few things.	2016-07-20 08:57:25 +02:00
Lennart Poettering	d724118e20	core: hide legacy bus properties We usually hide legacy bus properties from introspection. Let's do that for the InaccessibleDirectories= properties too. The properties stay accessible if requested, but they won't be listed anymore if people introspect the unit.	2016-07-20 08:55:50 +02:00
Alessandro Puccetti	2a624c36e6	doc,core: Read{Write,Only}Paths= and InaccessiblePaths= This patch renames Read{Write,Only}Directories= and InaccessibleDirectories= to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept as aliases but they are not advertised in the documentation. Renamed variables: `read_write_dirs` --> `read_write_paths` `read_only_dirs` --> `read_only_paths` `inaccessible_dirs` --> `inaccessible_paths`	2016-07-19 17:22:02 +02:00
Alessandro Puccetti	c4b4170746	namespace: unify limit behavior on non-directory paths Despite the name, `Read{Write,Only}Directories=` already allows for regular file paths to be masked. This commit adds the same behavior to `InaccessibleDirectories=` and makes it explicit in the doc. This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}` {dile,device}nodes and mounts on the appropriate one the paths specified in `InacessibleDirectories=`. Based on Luca's patch from https://github.com/systemd/systemd/pull/3327	2016-07-19 17:22:02 +02:00
Lukáš Nykrýn	ccc2c98e1b	manager: don't skip sigchld handler for main and control pid for services (#3738 ) During stop when service has one "regular" pid one main pid and one control pid and the sighld for the regular one is processed first the unit_tidy_watch_pids will skip the main and control pid and does not remove them from u->pids(). But then we skip the sigchld event because we already did one in the iteration and there are two pids in u->pids. v2: Use general unit_main_pid() and unit_control_pid() instead of reaching directly to service structure.	2016-07-16 15:04:13 -04:00
Zbigniew Jędrzejewski-Szmek	2ed968802c	tree-wide: get rid of selinux_context_t (#3732 ) `9eb9c93275` deprecated selinux_context_t. Replace with a simple char* everywhere. Alternative fix for #3719.	2016-07-15 18:44:02 +02:00
Zbigniew Jędrzejewski-Szmek	1071fd0823	macros: provide %_systemdgeneratordir and %_systemdusergeneratordir (#3672 ) ... as requested in https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/DJ7HDNRM5JGBSA4HL3UWW5ZGLQDJ6Y7M/. Adding the macro makes it marginally easier to create generators for outside projects. I opted for "generatordir" and "usergeneratordir" to match %unitdir and %userunitdir. OTOH, "_systemd" prefix makes it obvious that this is related to systemd. "%_generatordir" would be to generic of a name.	2016-07-15 09:35:49 +02:00
Lennart Poettering	2e79d1828a	shutdown: already sync IO before we enter the final killing spree This way, slow IO journald has to wait for can't cause it to reach the killing spree timeout and is hit by SIGKILL in addition to SIGTERM.	2016-07-12 17:38:19 +02:00
Lennart Poettering	d450612953	shutdown: use 90s SIGKILL timeout There's really no reason to use 10s here, let's instead default to 90s like we do for everything else. The SIGKILL during the final killing spree is in most regards the fourth level of a safety net, after all: any normal service should have already been stopped during the normal service shutdown logic, first via SIGTERM and then SIGKILL, and then also via SIGTERM during the finall killing spree before we send SIGKILL. And as a fourth level safety net it should only be required in exceptional cases, which means it's safe to rais the default timeout, as normal shutdowns should never be delayed by it. Note that journald excludes itself from the normal service shutdown, and relies on the final killing spree to terminate it (this is because it wants to cover the normal shutdown phase's complete logging). If the system's IO is excessively slow, then the 10s might not be enough for journald to sync everything to disk and logs might get lost during shutdown.	2016-07-12 17:32:30 +02:00
Michael Biebl	595bfe7df2	Various fixes for typos found by lintian (#3705 )	2016-07-12 12:52:11 +02:00
Luca Bruno	391b81cd03	seccomp: only abort on syscall name resolution failures (#3701 ) seccomp_syscall_resolve_name() can return a mix of positive and negative (pseudo-) syscall numbers, while errors are signaled via __NR_SCMP_ERROR. This commit lets the syscall filter parser only abort on real parsing failures, letting libseccomp handle pseudo-syscall number on its own and allowing proper multiplexed syscalls filtering.	2016-07-12 11:55:26 +02:00
Torstein Husebø	61233823aa	treewide: fix typos and remove accidental repetition of words	2016-07-11 16:18:43 +02:00
Evgeny Vereshchagin	224d3d8266	Merge pull request #3680 from joukewitteveen/pam-env Follow up on #3503 (pass service env vars to PAM sessions)	2016-07-08 17:33:12 +03:00
Jouke Witteveen	84eada2f7f	execute: Do not alter call-by-ref parameter on failure Prevent free from being called on (a part of) the call-by-reference variable env when setup_pam fails.	2016-07-08 09:42:48 +02:00
David Michael	4f952a3f07	core: queue loading transient units after setting their properties (#3676 ) The unit load queue can be processed in the middle of setting the unit's properties, so its load_state would no longer be UNIT_STUB for the check in bus_unit_set_properties(), which would cause it to incorrectly return an error.	2016-07-08 05:43:01 +02:00
Daniel Mack	78a4ee591a	cgroup: fix memory cgroup limit regression on kernel 3.10 (#3673 ) Commit `da4d897e` ("core: add cgroup memory controller support on the unified hierarchy (#3315)") changed the code in src/core/cgroup.c to always write the real numeric value from the cgroup parameters to the "memory.limit_in_bytes" attribute file. For parameters set to CGROUP_LIMIT_MAX, this results in the string "18446744073709551615" being written into that file, which is UINT64_MAX. Before that commit, CGROUP_LIMIT_MAX was special-cased to the string "-1". This causes a regression on CentOS 7, which is based on kernel 3.10, as the value is interpreted as signed 64 bit, and clamped to 0: [root@n54 ~]# echo 18446744073709551615 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes [root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes 0 [root@n54 ~]# echo -1 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes [root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes 9223372036854775807 Hence, all units that are subject to the limits enforced by the memory controller will crash immediately, even though they have no actual limit set. This happens to for the user.slice, for instance: [ 453.577153] Hardware name: SeaMicro SM15000-64-CC-AA-1Ox1/AMD Server CRB, BIOS Estoc.3.72.19.0018 08/19/2014 [ 453.587024] ffff880810c56780 00000000aae9501f ffff880813d7fcd0 ffffffff816360fc [ 453.594544] ffff880813d7fd60 ffffffff8163109c ffff88080ffc5000 ffff880813d7fd28 [ 453.602120] ffffffff00000202 fffeefff00000000 0000000000000001 ffff880810c56c03 [ 453.609680] Call Trace: [ 453.612156] [<ffffffff816360fc>] dump_stack+0x19/0x1b [ 453.617324] [<ffffffff8163109c>] dump_header+0x8e/0x214 [ 453.622671] [<ffffffff8116d20e>] oom_kill_process+0x24e/0x3b0 [ 453.628559] [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30 [ 453.634969] [<ffffffff811d4155>] mem_cgroup_oom_synchronize+0x575/0x5a0 [ 453.641721] [<ffffffff811d3520>] ? mem_cgroup_charge_common+0xc0/0xc0 [ 453.648299] [<ffffffff8116da84>] pagefault_out_of_memory+0x14/0x90 [ 453.654621] [<ffffffff8162f4cc>] mm_fault_error+0x68/0x12b [ 453.660233] [<ffffffff81642012>] __do_page_fault+0x3e2/0x450 [ 453.666017] [<ffffffff816420a3>] do_page_fault+0x23/0x80 [ 453.671467] [<ffffffff8163e308>] page_fault+0x28/0x30 [ 453.676656] Task in /user.slice/user-0.slice/user@0.service killed as a result of limit of /user.slice/user-0.slice/user@0.service [ 453.688477] memory: usage 0kB, limit 0kB, failcnt 7 [ 453.693391] memory+swap: usage 0kB, limit 9007199254740991kB, failcnt 0 [ 453.700039] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 [ 453.706076] Memory cgroup stats for /user.slice/user-0.slice/user@0.service: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB [ 453.725702] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 453.733614] [ 2837] 0 2837 11950 899 23 0 0 (systemd) [ 453.741919] Memory cgroup out of memory: Kill process 2837 ((systemd)) score 1 or sacrifice child [ 453.750831] Killed process 2837 ((systemd)) total-vm:47800kB, anon-rss:3188kB, file-rss:408kB Fix this issue by special-casing the UINT64_MAX case again.	2016-07-07 19:29:35 -07:00
Jouke Witteveen	1280503b7e	execute: Cleanup the environment early By cleaning up before setting up PAM we maintain control of overriding behavior in setting variables. Otherwise, pam_putenv is in control. This also makes sure we use a cleaned up environment in replacing variables in argv.	2016-07-07 14:15:50 +02:00
Kyle Walker	1e706c8dff	manager: Fixing a debug printf formatting mistake (#3640 ) A 'llu' formatting statement was used in a debugging printf statement instead of a 'PRIu64'. Correcting that mistake here.	2016-07-01 20:03:35 +03:00
Lennart Poettering	b12cc5b0f8	Merge pull request #3634 from disneyworldguy/v2sigchld manager: Only invoke a single sigchld per unit within a cleanup cycle	2016-06-30 15:57:39 -07:00
Martin Pitt	f15461b2b2	Merge pull request #3596 from poettering/machine-clean make "machinectl clean" asynchronous, and open it up via PolicyKit	2016-06-30 21:30:35 +02:00
Kyle Walker	36f20ae3b2	manager: Only invoke a single sigchld per unit within a cleanup cycle By default, each iteration of manager_dispatch_sigchld() results in a unit level sigchld event being invoked. For scope units, this results in a scope_sigchld_event() which can seemingly stall for workloads that have a large number of PIDs within the scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry as a result of pid_is_unwaited(). v2: This patch resolves this condition by only paying to cost of a sigchld in the underlying scope unit once per sigchld iteration. A new "sigchldgen" member resides within the Unit struct. The Manager is incremented via the sd event loop, accessed via sd_event_get_iteration, and the Unit member is set to the same value as the manager each time that a sigchld event is invoked. If the Manager iteration value and Unit member match, the sigchld event is not invoked for that iteration.	2016-06-30 15:16:47 -04:00
Franck Bui	6edefe0b06	pid1: restore console color support for containers (#3595 ) Commit `3a18b60489` introduced a regression that disabled the color mode for container. This patch fixes this.	2016-06-24 16:08:43 +02:00
Lennart Poettering	2b40998d3c	cgroup: minor coding style fix	2016-06-24 15:59:24 +02:00
Lennart Poettering	f4170c671b	execute: add a new easy-to-use RestrictRealtime= option to units It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and SCHED_DEADLINE is blocked, which my be used to lock up the system.	2016-06-23 01:45:45 +02:00
Lennart Poettering	abd84d4d83	execute: be a little less drastic when MemoryDenyWriteExecute= hits Let's politely refuse with EPERM rather than kill the whole thing right-away.	2016-06-23 01:35:04 +02:00
Lennart Poettering	686d9ba614	execute: set PR_SET_NO_NEW_PRIVS also in case the exec memory protection is used This was forgotten when MemoryDenyWriteExecute= was added: we should set NNP in all cases when we set seccomp filters.	2016-06-23 01:33:07 +02:00

... 2 3 4 5 6 ...

2905 Commits