mirror of
https://github.com/systemd/systemd.git
synced 2025-02-28 05:57:33 +03:00
Merge pull request #6763 from kinvolk/iaguis/no-new-privs
core: allow using seccomp without no_new_privs when unprivileged
This commit is contained in:
commit
00666ec71f
@ -823,21 +823,10 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting>
|
||||
<listitem><para>Takes a boolean argument. If true, ensures that the service process and all its
|
||||
children can never gain new privileges through <function>execve()</function> (e.g. via setuid or
|
||||
setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that
|
||||
a process and its children can never elevate privileges again. Defaults to false, but certain
|
||||
settings override this and ignore the value of this setting. This is the case when
|
||||
<varname>DynamicUser=</varname>, <varname>LockPersonality=</varname>,
|
||||
<varname>MemoryDenyWriteExecute=</varname>, <varname>PrivateDevices=</varname>,
|
||||
<varname>ProtectClock=</varname>, <varname>ProtectHostname=</varname>,
|
||||
<varname>ProtectKernelLogs=</varname>, <varname>ProtectKernelModules=</varname>,
|
||||
<varname>ProtectKernelTunables=</varname>, <varname>RestrictAddressFamilies=</varname>,
|
||||
<varname>RestrictNamespaces=</varname>, <varname>RestrictRealtime=</varname>,
|
||||
<varname>RestrictSUIDSGID=</varname>, <varname>SystemCallArchitectures=</varname>,
|
||||
<varname>SystemCallFilter=</varname>, or <varname>SystemCallLog=</varname> are specified. Note that
|
||||
even if this setting is overridden by them, <command>systemctl show</command> shows the original
|
||||
value of this setting. In case the service will be run in a new mount namespace anyway and SELinux is
|
||||
disabled, all file systems are mounted with <constant>MS_NOSUID</constant> flag. Also see
|
||||
the kernel document
|
||||
<ulink url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
|
||||
a process and its children can never elevate privileges again. Defaults to false. In case the service
|
||||
will be run in a new mount namespace anyway and SELinux is disabled, all file systems are mounted with
|
||||
<constant>MS_NOSUID</constant> flag. Also see <ulink
|
||||
url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
|
||||
</para>
|
||||
|
||||
<para>Note that this setting only has an effect on the unit's processes themselves (or any processes
|
||||
@ -1779,9 +1768,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
<citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry> of
|
||||
<filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>. For this setting the
|
||||
same restrictions regarding mount propagation and privileges apply as for
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above. If turned on and if running in user
|
||||
mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
|
||||
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above.</para>
|
||||
|
||||
<para>Note that the implementation of this setting might be impossible (for example if mount
|
||||
namespaces are not available), and the unit should be written in a way that does not solely rely on
|
||||
@ -1973,10 +1960,6 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
the system into the service, it is hence not suitable for services that need to take notice of system
|
||||
hostname changes dynamically.</para>
|
||||
|
||||
<para>If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
|
||||
capability (e.g. services for which <varname>User=</varname> is set),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
|
||||
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
|
||||
|
||||
<xi:include href="version-info.xml" xpointer="v242"/></listitem>
|
||||
@ -1994,9 +1977,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
Effectively, <filename>/dev/rtc0</filename>, <filename>/dev/rtc1</filename>, etc. are made read-only
|
||||
to the service. See
|
||||
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||||
for the details about <varname>DeviceAllow=</varname>. If this setting is on, but the unit doesn't
|
||||
have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for which
|
||||
<varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
for the details about <varname>DeviceAllow=</varname>.</para>
|
||||
|
||||
<para>It is recommended to turn this on for most services that do not need modify the clock or check
|
||||
its state.</para>
|
||||
@ -2018,13 +1999,10 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
<citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> mechanism. Few
|
||||
services need to write to these at runtime; it is hence recommended to turn this on for most services. For this
|
||||
setting the same restrictions regarding mount propagation and privileges apply as for
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. If this
|
||||
setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability
|
||||
(e.g. services for which <varname>User=</varname> is set),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied. Note that this option does not prevent
|
||||
indirect changes to kernel tunables effected by IPC calls to other processes. However,
|
||||
<varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system objects
|
||||
inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
|
||||
Note that this option does not prevent indirect changes to kernel tunables effected by IPC calls to
|
||||
other processes. However, <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system
|
||||
objects inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
|
||||
<varname>MountAPIVFS=yes</varname> is implied.</para>
|
||||
|
||||
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
|
||||
@ -2046,9 +2024,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
both privileged and unprivileged. To disable module auto-load feature please see
|
||||
<citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||||
<constant>kernel.modules_disabled</constant> mechanism and
|
||||
<filename>/proc/sys/kernel/modules_disabled</filename> documentation. If this setting is on,
|
||||
but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for
|
||||
which <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
<filename>/proc/sys/kernel/modules_disabled</filename> documentation.</para>
|
||||
|
||||
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
|
||||
|
||||
@ -2067,9 +2043,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
<citerefentry project='man-pages'><refentrytitle>syslog</refentrytitle><manvolnum>3</manvolnum></citerefentry>
|
||||
for userspace logging). The kernel exposes its log buffer to userspace via <filename>/dev/kmsg</filename> and
|
||||
<filename>/proc/kmsg</filename>. If enabled, these are made inaccessible to all the processes in the unit.
|
||||
If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
|
||||
capability (e.g. services for which <varname>User=</varname> is set),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
</para>
|
||||
|
||||
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
|
||||
|
||||
@ -2113,12 +2087,9 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
|
||||
including x86-64). Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
|
||||
recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
|
||||
restrictions of this option. Specifically, it is recommended to combine this option with
|
||||
<varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
|
||||
mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
|
||||
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. By default, no
|
||||
restrictions apply, all address families are accessible to processes. If assigned the empty string,
|
||||
any previous address family restriction changes are undone. This setting does not affect commands
|
||||
prefixed with <literal>+</literal>.</para>
|
||||
<varname>SystemCallArchitectures=native</varname> or similar. By default, no restrictions apply, all
|
||||
address families are accessible to processes. If assigned the empty string, any previous address family
|
||||
restriction changes are undone. This setting does not affect commands prefixed with <literal>+</literal>.</para>
|
||||
|
||||
<para>Use this option to limit exposure of processes to remote access, in particular via exotic and sensitive
|
||||
network protocols, such as <constant>AF_PACKET</constant>. Note that in most cases, the local
|
||||
@ -2251,9 +2222,7 @@ RestrictFileSystems=ext4</programlisting>
|
||||
creation and switching of the specified types of namespaces (or all of them, if true) access to the
|
||||
<function>setns()</function> system call with a zero flags parameter is prohibited. This setting is only
|
||||
supported on x86, x86-64, mips, mips-le, mips64, mips64-le, mips64-n32, mips64-le-n32, ppc64, ppc64-le, s390
|
||||
and s390x, and enforces no restrictions on other architectures. If running in user mode, or in system mode, but
|
||||
without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
and s390x, and enforces no restrictions on other architectures.</para>
|
||||
|
||||
<para>Example: if a unit has the following,
|
||||
<programlisting>RestrictNamespaces=cgroup ipc
|
||||
@ -2274,9 +2243,7 @@ RestrictNamespaces=~cgroup net</programlisting>
|
||||
project='man-pages'><refentrytitle>personality</refentrytitle><manvolnum>2</manvolnum></citerefentry> system
|
||||
call so that the kernel execution domain may not be changed from the default or the personality selected with
|
||||
<varname>Personality=</varname> directive. This may be useful to improve security, because odd personality
|
||||
emulations may be poorly tested and source of vulnerabilities. If running in user mode, or in system mode, but
|
||||
without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
emulations may be poorly tested and source of vulnerabilities.</para>
|
||||
|
||||
<xi:include href="version-info.xml" xpointer="v235"/></listitem>
|
||||
</varlistentry>
|
||||
@ -2308,9 +2275,7 @@ RestrictNamespaces=~cgroup net</programlisting>
|
||||
available on x86. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
|
||||
recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
|
||||
restrictions of this option. Specifically, it is recommended to combine this option with
|
||||
<varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
|
||||
mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
|
||||
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
|
||||
<varname>SystemCallArchitectures=native</varname> or similar.</para>
|
||||
|
||||
<xi:include href="version-info.xml" xpointer="v231"/></listitem>
|
||||
</varlistentry>
|
||||
@ -2322,9 +2287,7 @@ RestrictNamespaces=~cgroup net</programlisting>
|
||||
the unit are refused. This restricts access to realtime task scheduling policies such as
|
||||
<constant>SCHED_FIFO</constant>, <constant>SCHED_RR</constant> or <constant>SCHED_DEADLINE</constant>. See
|
||||
<citerefentry project='man-pages'><refentrytitle>sched</refentrytitle><manvolnum>7</manvolnum></citerefentry>
|
||||
for details about these scheduling policies. If running in user mode, or in system mode, but without the
|
||||
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied. Realtime scheduling policies may be used to monopolize CPU
|
||||
for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU
|
||||
time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service
|
||||
situations on the system. It is hence recommended to restrict access to realtime scheduling to the few programs
|
||||
that actually require them. Defaults to off.</para>
|
||||
@ -2338,10 +2301,8 @@ RestrictNamespaces=~cgroup net</programlisting>
|
||||
<listitem><para>Takes a boolean argument. If set, any attempts to set the set-user-ID (SUID) or
|
||||
set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see
|
||||
<citerefentry
|
||||
project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>). If
|
||||
running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant>
|
||||
capability (e.g. setting <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is
|
||||
implied. As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
|
||||
project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>).
|
||||
As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
|
||||
identity of other users, it is recommended to restrict creation of SUID/SGID files to the few
|
||||
programs that actually require them. Note that this restricts marking of any type of file system
|
||||
object with these bits, including both regular files and directories (where the SGID is a different
|
||||
@ -2457,15 +2418,12 @@ RestrictNamespaces=~cgroup net</programlisting>
|
||||
full list). This value will be returned when a deny-listed system call is triggered, instead of
|
||||
terminating the processes immediately. Special setting <literal>kill</literal> can be used to
|
||||
explicitly specify killing. This value takes precedence over the one given in
|
||||
<varname>SystemCallErrorNumber=</varname>, see below. If running in user mode, or in system mode,
|
||||
but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
|
||||
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. This feature
|
||||
makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful
|
||||
for enforcing a minimal sandboxing environment. Note that the <function>execve()</function>,
|
||||
<function>exit()</function>, <function>exit_group()</function>, <function>getrlimit()</function>,
|
||||
<function>rt_sigreturn()</function>, <function>sigreturn()</function> system calls and the system calls
|
||||
for querying time and sleeping are implicitly allow-listed and do not need to be listed
|
||||
explicitly. This option may be specified more than once, in which case the filter masks are
|
||||
<varname>SystemCallErrorNumber=</varname>, see below. This feature makes use of the Secure Computing Mode 2
|
||||
interfaces of the kernel ('seccomp filtering') and is useful for enforcing a minimal sandboxing environment.
|
||||
Note that the <function>execve()</function>, <function>exit()</function>, <function>exit_group()</function>,
|
||||
<function>getrlimit()</function>, <function>rt_sigreturn()</function>, <function>sigreturn()</function>
|
||||
system calls and the system calls for querying time and sleeping are implicitly allow-listed and do not
|
||||
need to be listed explicitly. This option may be specified more than once, in which case the filter masks are
|
||||
merged. If the empty string is assigned, the filter is reset, all prior assignments will have no
|
||||
effect. This does not affect commands prefixed with <literal>+</literal>.</para>
|
||||
|
||||
@ -2692,10 +2650,7 @@ SystemCallErrorNumber=EPERM</programlisting>
|
||||
as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and
|
||||
the special identifier <constant>native</constant>. The special identifier <constant>native</constant>
|
||||
implicitly maps to the native architecture of the system (or more precisely: to the architecture the system
|
||||
manager is compiled for). If running in user mode, or in system mode, but without the
|
||||
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no
|
||||
filtering is applied.</para>
|
||||
manager is compiled for). By default, this option is set to the empty list, i.e. no filtering is applied.</para>
|
||||
|
||||
<para>If this setting is used, processes of this unit will only be permitted to call native system calls, and
|
||||
system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated
|
||||
@ -2723,13 +2678,11 @@ SystemCallErrorNumber=EPERM</programlisting>
|
||||
<listitem><para>Takes a space-separated list of system call names. If this setting is used, all
|
||||
system calls executed by the unit processes for the listed ones will be logged. If the first
|
||||
character of the list is <literal>~</literal>, the effect is inverted: all system calls except the
|
||||
listed system calls will be logged. If running in user mode, or in system mode, but without the
|
||||
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
|
||||
<varname>NoNewPrivileges=yes</varname> is implied. This feature makes use of the Secure Computing
|
||||
Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for auditing or setting up a
|
||||
minimal sandboxing environment. This option may be specified more than once, in which case the filter
|
||||
masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will
|
||||
have no effect. This does not affect commands prefixed with <literal>+</literal>.</para>
|
||||
listed system calls will be logged. This feature makes use of the Secure Computing Mode 2 interfaces
|
||||
of the kernel ('seccomp filtering') and is useful for auditing or setting up a minimal sandboxing
|
||||
environment. This option may be specified more than once, in which case the filter masks are merged.
|
||||
If the empty string is assigned, the filter is reset, all prior assignments will have no effect.
|
||||
This does not affect commands prefixed with <literal>+</literal>.</para>
|
||||
|
||||
<xi:include href="version-info.xml" xpointer="v247"/></listitem>
|
||||
</varlistentry>
|
||||
|
@ -367,16 +367,16 @@ int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
int drop_capability(cap_value_t cv) {
|
||||
static int change_capability(cap_value_t cv, cap_flag_value_t flag) {
|
||||
_cleanup_cap_free_ cap_t tmp_cap = NULL;
|
||||
|
||||
tmp_cap = cap_get_proc();
|
||||
if (!tmp_cap)
|
||||
return -errno;
|
||||
|
||||
if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, CAP_CLEAR) < 0) ||
|
||||
(cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, CAP_CLEAR) < 0) ||
|
||||
(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0))
|
||||
if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, flag) < 0) ||
|
||||
(cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, flag) < 0) ||
|
||||
(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, flag) < 0))
|
||||
return -errno;
|
||||
|
||||
if (cap_set_proc(tmp_cap) < 0)
|
||||
@ -385,6 +385,14 @@ int drop_capability(cap_value_t cv) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
int drop_capability(cap_value_t cv) {
|
||||
return change_capability(cv, CAP_CLEAR);
|
||||
}
|
||||
|
||||
int keep_capability(cap_value_t cv) {
|
||||
return change_capability(cv, CAP_SET);
|
||||
}
|
||||
|
||||
bool ambient_capabilities_supported(void) {
|
||||
static int cache = -1;
|
||||
|
||||
|
@ -31,6 +31,7 @@ int capability_update_inherited_set(cap_t caps, uint64_t ambient_set);
|
||||
int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities);
|
||||
|
||||
int drop_capability(cap_value_t cv);
|
||||
int keep_capability(cap_value_t cv);
|
||||
|
||||
DEFINE_TRIVIAL_CLEANUP_FUNC_FULL(cap_t, cap_free, NULL);
|
||||
#define _cleanup_cap_free_ _cleanup_(cap_freep)
|
||||
|
@ -1378,15 +1378,7 @@ static bool context_has_syscall_logs(const ExecContext *c) {
|
||||
!hashmap_isempty(c->syscall_log);
|
||||
}
|
||||
|
||||
static bool context_has_no_new_privileges(const ExecContext *c) {
|
||||
assert(c);
|
||||
|
||||
if (c->no_new_privileges)
|
||||
return true;
|
||||
|
||||
if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
|
||||
return false;
|
||||
|
||||
static bool context_has_seccomp(const ExecContext *c) {
|
||||
/* We need NNP if we have any form of seccomp and are unprivileged */
|
||||
return c->lock_personality ||
|
||||
c->memory_deny_write_execute ||
|
||||
@ -1405,8 +1397,49 @@ static bool context_has_no_new_privileges(const ExecContext *c) {
|
||||
context_has_syscall_logs(c);
|
||||
}
|
||||
|
||||
static bool context_has_no_new_privileges(const ExecContext *c) {
|
||||
assert(c);
|
||||
|
||||
if (c->no_new_privileges)
|
||||
return true;
|
||||
|
||||
if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
|
||||
return false;
|
||||
|
||||
return context_has_seccomp(c);
|
||||
}
|
||||
|
||||
#if HAVE_SECCOMP
|
||||
|
||||
static bool seccomp_allows_drop_privileges(const ExecContext *c) {
|
||||
void *id, *val;
|
||||
bool has_capget = false, has_capset = false, has_prctl = false;
|
||||
|
||||
assert(c);
|
||||
|
||||
/* No syscall filter, we are allowed to drop privileges */
|
||||
if (hashmap_isempty(c->syscall_filter))
|
||||
return true;
|
||||
|
||||
HASHMAP_FOREACH_KEY(val, id, c->syscall_filter) {
|
||||
_cleanup_free_ char *name = NULL;
|
||||
|
||||
name = seccomp_syscall_resolve_num_arch(SCMP_ARCH_NATIVE, PTR_TO_INT(id) - 1);
|
||||
|
||||
if (streq(name, "capget"))
|
||||
has_capget = true;
|
||||
else if (streq(name, "capset"))
|
||||
has_capset = true;
|
||||
else if (streq(name, "prctl"))
|
||||
has_prctl = true;
|
||||
}
|
||||
|
||||
if (c->syscall_allow_list)
|
||||
return has_capget && has_capset && has_prctl;
|
||||
else
|
||||
return !(has_capget || has_capset || has_prctl);
|
||||
}
|
||||
|
||||
static bool skip_seccomp_unavailable(const ExecContext *c, const ExecParameters *p, const char* msg) {
|
||||
|
||||
if (is_seccomp_available())
|
||||
@ -3911,6 +3944,7 @@ int exec_invoke(
|
||||
needs_setuid, /* Do we need to do the actual setresuid()/setresgid() calls? */
|
||||
needs_mount_namespace, /* Do we need to set up a mount namespace for this kernel? */
|
||||
needs_ambient_hack; /* Do we need to apply the ambient capabilities hack? */
|
||||
bool keep_seccomp_privileges = false;
|
||||
#if HAVE_SELINUX
|
||||
_cleanup_free_ char *mac_selinux_context_net = NULL;
|
||||
bool use_selinux = false;
|
||||
@ -3920,6 +3954,9 @@ int exec_invoke(
|
||||
#endif
|
||||
#if HAVE_APPARMOR
|
||||
bool use_apparmor = false;
|
||||
#endif
|
||||
#if HAVE_SECCOMP
|
||||
uint64_t saved_bset = 0;
|
||||
#endif
|
||||
uid_t saved_uid = getuid();
|
||||
gid_t saved_gid = getgid();
|
||||
@ -4817,6 +4854,28 @@ int exec_invoke(
|
||||
(UINT64_C(1) << CAP_SETUID) |
|
||||
(UINT64_C(1) << CAP_SETGID);
|
||||
|
||||
#if HAVE_SECCOMP
|
||||
/* If the service has any form of a seccomp filter and it allows dropping privileges, we'll
|
||||
* keep the needed privileges to apply it even if we're not root. */
|
||||
if (needs_setuid &&
|
||||
uid_is_valid(uid) &&
|
||||
context_has_seccomp(context) &&
|
||||
seccomp_allows_drop_privileges(context)) {
|
||||
keep_seccomp_privileges = true;
|
||||
|
||||
if (prctl(PR_SET_KEEPCAPS, 1) < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, errno, "Failed to enable keep capabilities flag: %m");
|
||||
}
|
||||
|
||||
/* Save the current bounding set so we can restore it after applying the seccomp
|
||||
* filter */
|
||||
saved_bset = bset;
|
||||
bset |= (UINT64_C(1) << CAP_SYS_ADMIN) |
|
||||
(UINT64_C(1) << CAP_SETPCAP);
|
||||
}
|
||||
#endif
|
||||
|
||||
if (!cap_test_all(bset)) {
|
||||
r = capability_bounding_set_drop(bset, /* right_now= */ false);
|
||||
if (r < 0) {
|
||||
@ -4858,6 +4917,26 @@ int exec_invoke(
|
||||
return log_exec_error_errno(context, params, r, "Failed to change UID to " UID_FMT ": %m", uid);
|
||||
}
|
||||
|
||||
if (keep_seccomp_privileges) {
|
||||
r = drop_capability(CAP_SETUID);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETUID: %m");
|
||||
}
|
||||
|
||||
r = keep_capability(CAP_SYS_ADMIN);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, r, "Failed to keep CAP_SYS_ADMIN: %m");
|
||||
}
|
||||
|
||||
r = keep_capability(CAP_SETPCAP);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, r, "Failed to keep CAP_SETPCAP: %m");
|
||||
}
|
||||
}
|
||||
|
||||
if (!needs_ambient_hack && capability_ambient_set != 0) {
|
||||
|
||||
/* Raise the ambient capabilities after user change. */
|
||||
@ -5027,14 +5106,6 @@ int exec_invoke(
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
return log_exec_error_errno(context, params, r, "Failed to apply system call log filters: %m");
|
||||
}
|
||||
|
||||
/* This really should remain the last step before the execve(), to make sure our own code is unaffected
|
||||
* by the filter as little as possible. */
|
||||
r = apply_syscall_filter(context, params, needs_ambient_hack);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
|
||||
}
|
||||
#endif
|
||||
|
||||
#if HAVE_LIBBPF
|
||||
@ -5045,6 +5116,53 @@ int exec_invoke(
|
||||
}
|
||||
#endif
|
||||
|
||||
#if HAVE_SECCOMP
|
||||
/* This really should remain as close to the execve() as possible, to make sure our own code is unaffected
|
||||
* by the filter as little as possible. */
|
||||
r = apply_syscall_filter(context, params, needs_ambient_hack);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
|
||||
}
|
||||
|
||||
if (keep_seccomp_privileges) {
|
||||
/* Restore the capability bounding set with what's expected from the service + the
|
||||
* ambient capabilities hack */
|
||||
if (!cap_test_all(saved_bset)) {
|
||||
r = capability_bounding_set_drop(saved_bset, /* right_now= */ false);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_CAPABILITIES;
|
||||
return log_exec_error_errno(context, params, r, "Failed to drop bset capabilities: %m");
|
||||
}
|
||||
}
|
||||
|
||||
/* Only drop CAP_SYS_ADMIN if it's not in the bounding set, otherwise we'll break
|
||||
* applications that use it. */
|
||||
if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SYS_ADMIN))) {
|
||||
r = drop_capability(CAP_SYS_ADMIN);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SYS_ADMIN: %m");
|
||||
}
|
||||
}
|
||||
|
||||
/* Only drop CAP_SETPCAP if it's not in the bounding set, otherwise we'll break
|
||||
* applications that use it. */
|
||||
if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SETPCAP))) {
|
||||
r = drop_capability(CAP_SETPCAP);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETPCAP: %m");
|
||||
}
|
||||
}
|
||||
|
||||
if (prctl(PR_SET_KEEPCAPS, 0) < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return log_exec_error_errno(context, params, errno, "Failed to drop keep capabilities flag: %m");
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
}
|
||||
|
||||
if (!strv_isempty(context->unset_environment)) {
|
||||
|
@ -754,6 +754,18 @@ static void test_exec_systemcallfilter(Manager *m) {
|
||||
test(m, "exec-systemcallfilter-with-errno-in-allow-list.service", errno_from_name("EILSEQ"), CLD_EXITED);
|
||||
test(m, "exec-systemcallfilter-override-error-action.service", SIGSYS, CLD_KILLED);
|
||||
test(m, "exec-systemcallfilter-override-error-action2.service", errno_from_name("EILSEQ"), CLD_EXITED);
|
||||
|
||||
test(m, "exec-systemcallfilter-nonewprivileges.service", MANAGER_IS_SYSTEM(m) ? 0 : EXIT_GROUP, CLD_EXITED);
|
||||
test(m, "exec-systemcallfilter-nonewprivileges-protectclock.service", MANAGER_IS_SYSTEM(m) ? 0 : EXIT_GROUP, CLD_EXITED);
|
||||
|
||||
r = find_executable("capsh", NULL);
|
||||
if (r < 0) {
|
||||
log_notice_errno(r, "Skipping %s, could not find capsh binary: %m", __func__);
|
||||
return;
|
||||
}
|
||||
|
||||
test(m, "exec-systemcallfilter-nonewprivileges-bounding1.service", MANAGER_IS_SYSTEM(m) ? 0 : EXIT_GROUP, CLD_EXITED);
|
||||
test(m, "exec-systemcallfilter-nonewprivileges-bounding2.service", MANAGER_IS_SYSTEM(m) ? 0 : EXIT_GROUP, CLD_EXITED);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
@ -0,0 +1,10 @@
|
||||
# SPDX-License-Identifier: LGPL-2.1-or-later
|
||||
[Unit]
|
||||
Description=Test bounding set is right with SystemCallFilter and non-root user
|
||||
|
||||
[Service]
|
||||
ExecStart=/bin/sh -x -c 'c=$$(capsh --print | grep "Bounding set "); test "$$c" = "Bounding set =cap_net_bind_service"'
|
||||
Type=oneshot
|
||||
User=1
|
||||
SystemCallFilter=@system-service
|
||||
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
|
@ -0,0 +1,10 @@
|
||||
# SPDX-License-Identifier: LGPL-2.1-or-later
|
||||
[Unit]
|
||||
Description=Test bounding set is right with SystemCallFilter and non-root user
|
||||
|
||||
[Service]
|
||||
ExecStart=/bin/sh -x -c 'c=$$(capsh --print | grep "Bounding set "); test "$$c" = "Bounding set =cap_setpcap,cap_net_bind_service,cap_sys_admin"'
|
||||
Type=oneshot
|
||||
User=1
|
||||
SystemCallFilter=@system-service
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN CAP_SETPCAP CAP_NET_BIND_SERVICE
|
@ -0,0 +1,9 @@
|
||||
# SPDX-License-Identifier: LGPL-2.1-or-later
|
||||
[Unit]
|
||||
Description=Test no_new_privs is unset for ProtectClock and non-root user
|
||||
|
||||
[Service]
|
||||
ExecStart=/bin/sh -x -c 'c=$$(cat /proc/self/status | grep "NoNewPrivs: "); test "$$c" = "NoNewPrivs: 0"'
|
||||
Type=oneshot
|
||||
User=1
|
||||
ProtectClock=yes
|
@ -0,0 +1,9 @@
|
||||
# SPDX-License-Identifier: LGPL-2.1-or-later
|
||||
[Unit]
|
||||
Description=Test no_new_privs is unset for SystemCallFilter and non-root user
|
||||
|
||||
[Service]
|
||||
ExecStart=/bin/sh -x -c 'c=$$(cat /proc/self/status | grep "NoNewPrivs: "); test "$$c" = "NoNewPrivs: 0"'
|
||||
Type=oneshot
|
||||
User=1
|
||||
SystemCallFilter=@system-service
|
Loading…
x
Reference in New Issue
Block a user