2019-05-19 13:08:55 +01:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-16 15:20:36 -07:00
/*
* linux / fs / exec . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*/
/*
* # ! - checking implemented by tytso .
*/
/*
* Demand - loading implemented 01.12 .91 - no need to read anything but
* the header into memory . The inode of the executable is put into
* " current->executable " , and page faults do the actual loading . Clean .
*
* Once more I can proudly say that linux stood up to being changed : it
* was less than 2 hours work to get demand - loading completely implemented .
*
* Demand loading changed July 1993 by Eric Youngdale . Use mmap instead ,
* current - > executable is only used by the procfs . This allows a dispatch
* table to check for several different types of binary formats . We keep
* trying until we recognize the file or we run out of supported binary
2016-12-21 16:26:24 +11:00
* formats .
2005-04-16 15:20:36 -07:00
*/
2020-10-02 10:38:15 -07:00
# include <linux/kernel_read_file.h>
2005-04-16 15:20:36 -07:00
# include <linux/slab.h>
# include <linux/file.h>
2008-04-24 07:44:08 -04:00
# include <linux/fdtable.h>
2008-07-25 01:45:43 -07:00
# include <linux/mm.h>
2005-04-16 15:20:36 -07:00
# include <linux/stat.h>
# include <linux/fcntl.h>
2008-07-25 01:45:43 -07:00
# include <linux/swap.h>
2007-10-16 23:26:35 -07:00
# include <linux/string.h>
2005-04-16 15:20:36 -07:00
# include <linux/init.h>
2017-02-08 18:51:29 +01:00
# include <linux/sched/mm.h>
2017-02-08 18:51:30 +01:00
# include <linux/sched/coredump.h>
2017-02-08 18:51:30 +01:00
# include <linux/sched/signal.h>
2017-02-08 18:51:31 +01:00
# include <linux/sched/numa_balancing.h>
2017-02-08 18:51:36 +01:00
# include <linux/sched/task.h>
2008-07-28 15:46:18 -07:00
# include <linux/pagemap.h>
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
# include <linux/perf_event.h>
2005-04-16 15:20:36 -07:00
# include <linux/highmem.h>
# include <linux/spinlock.h>
# include <linux/key.h>
# include <linux/personality.h>
# include <linux/binfmts.h>
# include <linux/utsname.h>
2006-12-08 02:38:01 -08:00
# include <linux/pid_namespace.h>
2005-04-16 15:20:36 -07:00
# include <linux/module.h>
# include <linux/namei.h>
# include <linux/mount.h>
# include <linux/security.h>
# include <linux/syscalls.h>
2006-09-30 23:28:59 -07:00
# include <linux/tsacct_kern.h>
2005-11-07 00:59:16 -08:00
# include <linux/cn_proc.h>
2006-04-26 14:04:08 -04:00
# include <linux/audit.h>
2008-07-09 10:28:40 +02:00
# include <linux/kmod.h>
2008-12-17 13:53:20 -05:00
# include <linux/fsnotify.h>
2009-03-29 19:50:06 -04:00
# include <linux/fs_struct.h>
2010-10-26 14:21:23 -07:00
# include <linux/oom.h>
2011-03-06 18:02:54 +01:00
# include <linux/compat.h>
2015-12-28 16:02:29 -05:00
# include <linux/vmalloc.h>
2020-09-13 13:09:39 -06:00
# include <linux/io_uring.h>
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-27 14:32:34 -05:00
# include <linux/syscall_user_dispatch.h>
2022-01-21 22:13:17 -08:00
# include <linux/coredump.h>
2005-04-16 15:20:36 -07:00
2016-12-24 11:46:01 -08:00
# include <linux/uaccess.h>
2005-04-16 15:20:36 -07:00
# include <asm/mmu_context.h>
2007-07-19 01:48:16 -07:00
# include <asm/tlb.h>
2012-01-10 15:08:09 -08:00
# include <trace/events/task.h>
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
# include "internal.h"
2005-04-16 15:20:36 -07:00
2012-02-07 10:11:05 -06:00
# include <trace/events/sched.h>
2020-05-29 22:00:54 -05:00
static int bprm_creds_from_file ( struct linux_binprm * bprm ) ;
2005-06-23 00:09:43 -07:00
int suid_dumpable = 0 ;
2007-10-16 23:26:03 -07:00
static LIST_HEAD ( formats ) ;
2005-04-16 15:20:36 -07:00
static DEFINE_RWLOCK ( binfmt_lock ) ;
2012-03-17 03:05:16 -04:00
void __register_binfmt ( struct linux_binfmt * fmt , int insert )
2005-04-16 15:20:36 -07:00
{
write_lock ( & binfmt_lock ) ;
2009-04-30 15:08:49 -07:00
insert ? list_add ( & fmt - > lh , & formats ) :
list_add_tail ( & fmt - > lh , & formats ) ;
2005-04-16 15:20:36 -07:00
write_unlock ( & binfmt_lock ) ;
}
2009-04-30 15:08:49 -07:00
EXPORT_SYMBOL ( __register_binfmt ) ;
2005-04-16 15:20:36 -07:00
2007-10-16 23:26:04 -07:00
void unregister_binfmt ( struct linux_binfmt * fmt )
2005-04-16 15:20:36 -07:00
{
write_lock ( & binfmt_lock ) ;
2007-10-16 23:26:03 -07:00
list_del ( & fmt - > lh ) ;
2005-04-16 15:20:36 -07:00
write_unlock ( & binfmt_lock ) ;
}
EXPORT_SYMBOL ( unregister_binfmt ) ;
static inline void put_binfmt ( struct linux_binfmt * fmt )
{
module_put ( fmt - > module ) ;
}
2015-06-29 14:42:03 -05:00
bool path_noexec ( const struct path * path )
{
return ( path - > mnt - > mnt_flags & MNT_NOEXEC ) | |
( path - > mnt - > mnt_sb - > s_iflags & SB_I_NOEXEC ) ;
}
2014-04-03 14:48:27 -07:00
# ifdef CONFIG_USELIB
2005-04-16 15:20:36 -07:00
/*
* Note that a shared library must be both readable and executable due to
* security reasons .
*
2022-02-11 08:09:40 -08:00
* Also note that we take the address to load from the file itself .
2005-04-16 15:20:36 -07:00
*/
2009-01-14 14:14:29 +01:00
SYSCALL_DEFINE1 ( uselib , const char __user * , library )
2005-04-16 15:20:36 -07:00
{
2013-09-22 16:27:52 -04:00
struct linux_binfmt * fmt ;
2008-07-26 03:33:14 -04:00
struct file * file ;
2012-10-10 15:25:28 -04:00
struct filename * tmp = getname ( library ) ;
2008-07-26 03:33:14 -04:00
int error = PTR_ERR ( tmp ) ;
2011-02-23 17:44:09 -05:00
static const struct open_flags uselib_flags = {
. open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC ,
2015-12-26 22:33:24 -05:00
. acc_mode = MAY_READ | MAY_EXEC ,
2013-06-11 08:23:01 +04:00
. intent = LOOKUP_OPEN ,
. lookup_flags = LOOKUP_FOLLOW ,
2011-02-23 17:44:09 -05:00
} ;
2008-07-26 03:33:14 -04:00
2009-04-06 11:16:22 -04:00
if ( IS_ERR ( tmp ) )
goto out ;
2013-06-11 08:23:01 +04:00
file = do_filp_open ( AT_FDCWD , tmp , & uselib_flags ) ;
2009-04-06 11:16:22 -04:00
putname ( tmp ) ;
error = PTR_ERR ( file ) ;
if ( IS_ERR ( file ) )
2005-04-16 15:20:36 -07:00
goto out ;
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:26 -07:00
/*
* may_open ( ) has already checked for this , so it should be
* impossible to trip now . But we need to be extra cautious
* and check again at the very end too .
*/
2020-08-11 18:36:23 -07:00
error = - EACCES ;
exec: move path_noexec() check earlier
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:30 -07:00
if ( WARN_ON_ONCE ( ! S_ISREG ( file_inode ( file ) - > i_mode ) | |
path_noexec ( & file - > f_path ) ) )
2005-04-16 15:20:36 -07:00
goto exit ;
2009-12-17 21:24:21 -05:00
fsnotify_open ( file ) ;
2008-12-17 13:53:20 -05:00
2005-04-16 15:20:36 -07:00
error = - ENOEXEC ;
2013-09-22 16:27:52 -04:00
read_lock ( & binfmt_lock ) ;
list_for_each_entry ( fmt , & formats , lh ) {
if ( ! fmt - > load_shlib )
continue ;
if ( ! try_module_get ( fmt - > module ) )
continue ;
2005-04-16 15:20:36 -07:00
read_unlock ( & binfmt_lock ) ;
2013-09-22 16:27:52 -04:00
error = fmt - > load_shlib ( file ) ;
read_lock ( & binfmt_lock ) ;
put_binfmt ( fmt ) ;
if ( error ! = - ENOEXEC )
break ;
2005-04-16 15:20:36 -07:00
}
2013-09-22 16:27:52 -04:00
read_unlock ( & binfmt_lock ) ;
2009-04-06 11:16:22 -04:00
exit :
2005-04-16 15:20:36 -07:00
fput ( file ) ;
out :
return error ;
}
2014-04-03 14:48:27 -07:00
# endif /* #ifdef CONFIG_USELIB */
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
# ifdef CONFIG_MMU
2011-03-06 18:03:11 +01:00
/*
* The nascent bprm - > mm is not visible until exec_mmap ( ) but it can
* use a lot of memory , account these pages in current - > mm temporary
* for oom_badness ( ) - > get_mm_rss ( ) . Once exec succeeds or fails , we
* change the counter back via acct_arg_size ( 0 ) .
*/
2011-03-06 18:02:54 +01:00
static void acct_arg_size ( struct linux_binprm * bprm , unsigned long pages )
2010-11-30 20:55:34 +01:00
{
struct mm_struct * mm = current - > mm ;
long diff = ( long ) ( pages - bprm - > vma_pages ) ;
if ( ! mm | | ! diff )
return ;
bprm - > vma_pages = pages ;
add_mm_counter ( mm , MM_ANONPAGES , diff ) ;
}
2011-03-06 18:02:54 +01:00
static struct page * get_arg_page ( struct linux_binprm * bprm , unsigned long pos ,
2007-07-19 01:48:16 -07:00
int write )
{
struct page * page ;
int ret ;
2016-10-13 01:20:17 +01:00
unsigned int gup_flags = FOLL_FORCE ;
2007-07-19 01:48:16 -07:00
# ifdef CONFIG_STACK_GROWSUP
if ( write ) {
2011-05-24 17:11:44 -07:00
ret = expand_downwards ( bprm - > vma , pos ) ;
2007-07-19 01:48:16 -07:00
if ( ret < 0 )
return NULL ;
}
# endif
2016-10-13 01:20:17 +01:00
if ( write )
gup_flags | = FOLL_WRITE ;
2016-02-12 13:01:54 -08:00
/*
* We are doing an exec ( ) . ' current ' is the process
* doing the exec and bprm - > mm is the new process ' s mm .
*/
2021-09-02 14:56:46 -07:00
mmap_read_lock ( bprm - > mm ) ;
2020-08-11 18:39:01 -07:00
ret = get_user_pages_remote ( bprm - > mm , pos , 1 , gup_flags ,
2016-12-14 15:06:52 -08:00
& page , NULL , NULL ) ;
2021-09-02 14:56:46 -07:00
mmap_read_unlock ( bprm - > mm ) ;
2007-07-19 01:48:16 -07:00
if ( ret < = 0 )
return NULL ;
2019-01-03 15:28:11 -08:00
if ( write )
acct_arg_size ( bprm , vma_pages ( bprm - > vma ) ) ;
2007-07-19 01:48:16 -07:00
return page ;
}
static void put_arg_page ( struct page * page )
{
put_page ( page ) ;
}
static void free_arg_pages ( struct linux_binprm * bprm )
{
}
static void flush_arg_page ( struct linux_binprm * bprm , unsigned long pos ,
struct page * page )
{
flush_cache_page ( bprm - > vma , pos , page_to_pfn ( page ) ) ;
}
static int __bprm_mm_init ( struct linux_binprm * bprm )
{
2009-01-06 14:40:44 -08:00
int err ;
2007-07-19 01:48:16 -07:00
struct vm_area_struct * vma = NULL ;
struct mm_struct * mm = bprm - > mm ;
2018-07-21 15:24:03 -07:00
bprm - > vma = vma = vm_area_alloc ( mm ) ;
2007-07-19 01:48:16 -07:00
if ( ! vma )
2009-01-06 14:40:44 -08:00
return - ENOMEM ;
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 16:37:35 -07:00
vma_set_anonymous ( vma ) ;
2007-07-19 01:48:16 -07:00
2020-06-08 21:33:25 -07:00
if ( mmap_write_lock_killable ( mm ) ) {
2016-05-23 16:26:02 -07:00
err = - EINTR ;
goto err_free ;
}
2007-07-19 01:48:16 -07:00
/*
* Place the stack at the largest stack address the architecture
* supports . Later , we ' ll move this to an appropriate place . We don ' t
* use STACK_TOP because that can depend on attributes which aren ' t
* configured yet .
*/
2011-07-26 16:08:40 -07:00
BUILD_BUG_ON ( VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP ) ;
2007-07-19 01:48:16 -07:00
vma - > vm_end = STACK_TOP_MAX ;
vma - > vm_start = vma - > vm_end - PAGE_SIZE ;
2013-09-11 14:22:24 -07:00
vma - > vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP ;
2007-10-18 23:39:15 -07:00
vma - > vm_page_prot = vm_get_page_prot ( vma - > vm_flags ) ;
2010-12-09 15:29:42 +01:00
2007-07-19 01:48:16 -07:00
err = insert_vm_struct ( mm , vma ) ;
2009-01-06 14:40:44 -08:00
if ( err )
2007-07-19 01:48:16 -07:00
goto err ;
mm - > stack_vm = mm - > total_vm = 1 ;
2020-06-08 21:33:25 -07:00
mmap_write_unlock ( mm ) ;
2007-07-19 01:48:16 -07:00
bprm - > p = vma - > vm_end - sizeof ( void * ) ;
return 0 ;
err :
2020-06-08 21:33:25 -07:00
mmap_write_unlock ( mm ) ;
2016-05-23 16:26:02 -07:00
err_free :
2009-01-06 14:40:44 -08:00
bprm - > vma = NULL ;
2018-07-21 13:48:51 -07:00
vm_area_free ( vma ) ;
2007-07-19 01:48:16 -07:00
return err ;
}
static bool valid_arg_len ( struct linux_binprm * bprm , long len )
{
return len < = MAX_ARG_STRLEN ;
}
# else
2011-03-06 18:02:54 +01:00
static inline void acct_arg_size ( struct linux_binprm * bprm , unsigned long pages )
2010-11-30 20:55:34 +01:00
{
}
2011-03-06 18:02:54 +01:00
static struct page * get_arg_page ( struct linux_binprm * bprm , unsigned long pos ,
2007-07-19 01:48:16 -07:00
int write )
{
struct page * page ;
page = bprm - > page [ pos / PAGE_SIZE ] ;
if ( ! page & & write ) {
page = alloc_page ( GFP_HIGHUSER | __GFP_ZERO ) ;
if ( ! page )
return NULL ;
bprm - > page [ pos / PAGE_SIZE ] = page ;
}
return page ;
}
static void put_arg_page ( struct page * page )
{
}
static void free_arg_page ( struct linux_binprm * bprm , int i )
{
if ( bprm - > page [ i ] ) {
__free_page ( bprm - > page [ i ] ) ;
bprm - > page [ i ] = NULL ;
}
}
static void free_arg_pages ( struct linux_binprm * bprm )
{
int i ;
for ( i = 0 ; i < MAX_ARG_PAGES ; i + + )
free_arg_page ( bprm , i ) ;
}
static void flush_arg_page ( struct linux_binprm * bprm , unsigned long pos ,
struct page * page )
{
}
static int __bprm_mm_init ( struct linux_binprm * bprm )
{
bprm - > p = PAGE_SIZE * MAX_ARG_PAGES - sizeof ( void * ) ;
return 0 ;
}
static bool valid_arg_len ( struct linux_binprm * bprm , long len )
{
return len < = bprm - > p ;
}
# endif /* CONFIG_MMU */
/*
* Create a new mm_struct and populate it with a temporary stack
* vm_area_struct . We don ' t have enough context at this point to set the stack
* flags , permissions , and offset , so we use temporary values . We ' ll update
* them later in setup_arg_pages ( ) .
*/
2013-02-20 13:16:01 +11:00
static int bprm_mm_init ( struct linux_binprm * bprm )
2007-07-19 01:48:16 -07:00
{
int err ;
struct mm_struct * mm = NULL ;
bprm - > mm = mm = mm_alloc ( ) ;
err = - ENOMEM ;
if ( ! mm )
goto err ;
2018-04-10 16:35:01 -07:00
/* Save current stack limit for all calculations made during exec. */
task_lock ( current - > group_leader ) ;
bprm - > rlim_stack = current - > signal - > rlim [ RLIMIT_STACK ] ;
task_unlock ( current - > group_leader ) ;
2007-07-19 01:48:16 -07:00
err = __bprm_mm_init ( bprm ) ;
if ( err )
goto err ;
return 0 ;
err :
if ( mm ) {
bprm - > mm = NULL ;
mmdrop ( mm ) ;
}
return err ;
}
2011-03-06 18:02:37 +01:00
struct user_arg_ptr {
2011-03-06 18:02:54 +01:00
# ifdef CONFIG_COMPAT
bool is_compat ;
# endif
union {
const char __user * const __user * native ;
# ifdef CONFIG_COMPAT
2012-09-30 13:38:55 -04:00
const compat_uptr_t __user * compat ;
2011-03-06 18:02:54 +01:00
# endif
} ptr ;
2011-03-06 18:02:37 +01:00
} ;
static const char __user * get_user_arg_ptr ( struct user_arg_ptr argv , int nr )
2011-03-06 18:02:21 +01:00
{
2011-03-06 18:02:54 +01:00
const char __user * native ;
# ifdef CONFIG_COMPAT
if ( unlikely ( argv . is_compat ) ) {
compat_uptr_t compat ;
if ( get_user ( compat , argv . ptr . compat + nr ) )
return ERR_PTR ( - EFAULT ) ;
2011-03-06 18:02:21 +01:00
2011-03-06 18:02:54 +01:00
return compat_ptr ( compat ) ;
}
# endif
if ( get_user ( native , argv . ptr . native + nr ) )
2011-03-06 18:02:21 +01:00
return ERR_PTR ( - EFAULT ) ;
2011-03-06 18:02:54 +01:00
return native ;
2011-03-06 18:02:21 +01:00
}
2005-04-16 15:20:36 -07:00
/*
* count ( ) counts the number of strings in array ARGV .
*/
2011-03-06 18:02:37 +01:00
static int count ( struct user_arg_ptr argv , int max )
2005-04-16 15:20:36 -07:00
{
int i = 0 ;
2011-03-06 18:02:54 +01:00
if ( argv . ptr . native ! = NULL ) {
2005-04-16 15:20:36 -07:00
for ( ; ; ) {
2011-03-06 18:02:21 +01:00
const char __user * p = get_user_arg_ptr ( argv , i ) ;
2005-04-16 15:20:36 -07:00
if ( ! p )
break ;
2011-03-06 18:02:21 +01:00
if ( IS_ERR ( p ) )
return - EFAULT ;
2013-01-11 14:31:48 -08:00
if ( i > = max )
2005-04-16 15:20:36 -07:00
return - E2BIG ;
2013-01-11 14:31:48 -08:00
+ + i ;
2010-09-07 19:37:06 -07:00
if ( fatal_signal_pending ( current ) )
return - ERESTARTNOHAND ;
2005-04-16 15:20:36 -07:00
cond_resched ( ) ;
}
}
return i ;
}
2020-07-13 12:06:48 -05:00
static int count_strings_kernel ( const char * const * argv )
{
int i ;
if ( ! argv )
return 0 ;
for ( i = 0 ; argv [ i ] ; + + i ) {
if ( i > = MAX_ARG_STRINGS )
return - E2BIG ;
if ( fatal_signal_pending ( current ) )
return - ERESTARTNOHAND ;
cond_resched ( ) ;
}
return i ;
}
2020-07-12 08:23:54 -05:00
static int bprm_stack_limits ( struct linux_binprm * bprm )
2019-01-03 15:28:11 -08:00
{
unsigned long limit , ptr_size ;
/*
* Limit to 1 / 4 of the max stack size or 3 / 4 of _STK_LIM
* ( whichever is smaller ) for the argv + env strings .
* This ensures that :
* - the remaining binfmt code will not run out of stack space ,
* - the program will have a reasonable amount of stack left
* to work from .
*/
limit = _STK_LIM / 4 * 3 ;
limit = min ( limit , bprm - > rlim_stack . rlim_cur / 4 ) ;
/*
* We ' ve historically supported up to 32 pages ( ARG_MAX )
* of argument strings even with small stacks
*/
limit = max_t ( unsigned long , limit , ARG_MAX ) ;
/*
* We must account for the size of all the argv and envp pointers to
* the argv and envp strings , since they will also take up space in
* the stack . They aren ' t stored until much later when we can ' t
* signal to the parent that the child has run out of stack space .
* Instead , calculate it here so it ' s possible to fail gracefully .
2022-01-31 16:09:47 -08:00
*
* In the case of argc = 0 , make sure there is space for adding a
* empty string ( which will bump argc to 1 ) , to ensure confused
* userspace programs don ' t start processing from argv [ 1 ] , thinking
* argc can never be 0 , to keep them from walking envp by accident .
* See do_execveat_common ( ) .
2019-01-03 15:28:11 -08:00
*/
2022-01-31 16:09:47 -08:00
ptr_size = ( max ( bprm - > argc , 1 ) + bprm - > envc ) * sizeof ( void * ) ;
2019-01-03 15:28:11 -08:00
if ( limit < = ptr_size )
return - E2BIG ;
limit - = ptr_size ;
bprm - > argmin = bprm - > p - limit ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
/*
2007-07-19 01:48:16 -07:00
* ' copy_strings ( ) ' copies argument / environment strings from the old
* processes ' s memory to the new process ' s stack . The call to get_user_pages ( )
* ensures the destination page is created and not swapped out .
2005-04-16 15:20:36 -07:00
*/
2011-03-06 18:02:37 +01:00
static int copy_strings ( int argc , struct user_arg_ptr argv ,
2005-05-05 16:16:09 -07:00
struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
struct page * kmapped_page = NULL ;
char * kaddr = NULL ;
2007-07-19 01:48:16 -07:00
unsigned long kpos = 0 ;
2005-04-16 15:20:36 -07:00
int ret ;
while ( argc - - > 0 ) {
2010-08-17 23:52:56 +01:00
const char __user * str ;
2005-04-16 15:20:36 -07:00
int len ;
unsigned long pos ;
2011-03-06 18:02:21 +01:00
ret = - EFAULT ;
str = get_user_arg_ptr ( argv , argc ) ;
if ( IS_ERR ( str ) )
2005-04-16 15:20:36 -07:00
goto out ;
2011-03-06 18:02:21 +01:00
len = strnlen_user ( str , MAX_ARG_STRLEN ) ;
if ( ! len )
goto out ;
ret = - E2BIG ;
if ( ! valid_arg_len ( bprm , len ) )
2005-04-16 15:20:36 -07:00
goto out ;
2022-02-11 08:09:40 -08:00
/* We're going to work our way backwards. */
2005-04-16 15:20:36 -07:00
pos = bprm - > p ;
2007-07-19 01:48:16 -07:00
str + = len ;
bprm - > p - = len ;
2019-01-03 15:28:11 -08:00
# ifdef CONFIG_MMU
if ( bprm - > p < bprm - > argmin )
goto out ;
# endif
2005-04-16 15:20:36 -07:00
while ( len > 0 ) {
int offset , bytes_to_copy ;
2010-09-07 19:37:06 -07:00
if ( fatal_signal_pending ( current ) ) {
ret = - ERESTARTNOHAND ;
goto out ;
}
2010-09-07 19:36:28 -07:00
cond_resched ( ) ;
2005-04-16 15:20:36 -07:00
offset = pos % PAGE_SIZE ;
2007-07-19 01:48:16 -07:00
if ( offset = = 0 )
offset = PAGE_SIZE ;
bytes_to_copy = offset ;
if ( bytes_to_copy > len )
bytes_to_copy = len ;
offset - = bytes_to_copy ;
pos - = bytes_to_copy ;
str - = bytes_to_copy ;
len - = bytes_to_copy ;
if ( ! kmapped_page | | kpos ! = ( pos & PAGE_MASK ) ) {
struct page * page ;
page = get_arg_page ( bprm , pos , 1 ) ;
2005-04-16 15:20:36 -07:00
if ( ! page ) {
2007-07-19 01:48:16 -07:00
ret = - E2BIG ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2007-07-19 01:48:16 -07:00
if ( kmapped_page ) {
2021-09-02 14:56:36 -07:00
flush_dcache_page ( kmapped_page ) ;
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kunmap_local ( kaddr ) ;
2007-07-19 01:48:16 -07:00
put_arg_page ( kmapped_page ) ;
}
2005-04-16 15:20:36 -07:00
kmapped_page = page ;
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kaddr = kmap_local_page ( kmapped_page ) ;
2007-07-19 01:48:16 -07:00
kpos = pos & PAGE_MASK ;
flush_arg_page ( bprm , kpos , kmapped_page ) ;
2005-04-16 15:20:36 -07:00
}
2007-07-19 01:48:16 -07:00
if ( copy_from_user ( kaddr + offset , str , bytes_to_copy ) ) {
2005-04-16 15:20:36 -07:00
ret = - EFAULT ;
goto out ;
}
}
}
ret = 0 ;
out :
2007-07-19 01:48:16 -07:00
if ( kmapped_page ) {
2021-09-02 14:56:36 -07:00
flush_dcache_page ( kmapped_page ) ;
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kunmap_local ( kaddr ) ;
2007-07-19 01:48:16 -07:00
put_arg_page ( kmapped_page ) ;
}
2005-04-16 15:20:36 -07:00
return ret ;
}
/*
2020-06-04 16:51:14 -07:00
* Copy and argument / environment string from the kernel to the processes stack .
2005-04-16 15:20:36 -07:00
*/
2020-06-04 16:51:14 -07:00
int copy_string_kernel ( const char * arg , struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
2020-06-04 16:51:18 -07:00
int len = strnlen ( arg , MAX_ARG_STRLEN ) + 1 /* terminating NUL */ ;
unsigned long pos = bprm - > p ;
if ( len = = 0 )
return - EFAULT ;
if ( ! valid_arg_len ( bprm , len ) )
return - E2BIG ;
/* We're going to work our way backwards. */
arg + = len ;
bprm - > p - = len ;
if ( IS_ENABLED ( CONFIG_MMU ) & & bprm - > p < bprm - > argmin )
return - E2BIG ;
2011-03-06 18:02:37 +01:00
2020-06-04 16:51:18 -07:00
while ( len > 0 ) {
unsigned int bytes_to_copy = min_t ( unsigned int , len ,
min_not_zero ( offset_in_page ( pos ) , PAGE_SIZE ) ) ;
struct page * page ;
2011-03-06 18:02:37 +01:00
2020-06-04 16:51:18 -07:00
pos - = bytes_to_copy ;
arg - = bytes_to_copy ;
len - = bytes_to_copy ;
2011-03-06 18:02:37 +01:00
2020-06-04 16:51:18 -07:00
page = get_arg_page ( bprm , pos , 1 ) ;
if ( ! page )
return - E2BIG ;
flush_arg_page ( bprm , pos & PAGE_MASK , page ) ;
2022-07-24 23:25:23 +02:00
memcpy_to_page ( page , offset_in_page ( pos ) , arg , bytes_to_copy ) ;
2020-06-04 16:51:18 -07:00
put_arg_page ( page ) ;
}
return 0 ;
2005-04-16 15:20:36 -07:00
}
2020-06-04 16:51:14 -07:00
EXPORT_SYMBOL ( copy_string_kernel ) ;
2005-04-16 15:20:36 -07:00
2020-07-13 12:06:48 -05:00
static int copy_strings_kernel ( int argc , const char * const * argv ,
struct linux_binprm * bprm )
{
while ( argc - - > 0 ) {
int ret = copy_string_kernel ( argv [ argc ] , bprm ) ;
if ( ret < 0 )
return ret ;
if ( fatal_signal_pending ( current ) )
return - ERESTARTNOHAND ;
cond_resched ( ) ;
}
return 0 ;
}
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_MMU
2007-07-19 01:48:16 -07:00
2005-04-16 15:20:36 -07:00
/*
2007-07-19 01:48:16 -07:00
* During bprm_mm_init ( ) , we create a temporary stack at STACK_TOP_MAX . Once
* the binfmt code determines where the new stack should reside , we shift it to
* its final location . The process proceeds as follows :
2005-04-16 15:20:36 -07:00
*
2007-07-19 01:48:16 -07:00
* 1 ) Use shift to calculate the new vma endpoints .
* 2 ) Extend vma to cover both the old and new ranges . This ensures the
* arguments passed to subsequent functions are consistent .
* 3 ) Move vma ' s page tables to the new range .
* 4 ) Free up any cleared pgd range .
* 5 ) Shrink the vma to cover only the new range .
2005-04-16 15:20:36 -07:00
*/
2007-07-19 01:48:16 -07:00
static int shift_arg_pages ( struct vm_area_struct * vma , unsigned long shift )
2005-04-16 15:20:36 -07:00
{
struct mm_struct * mm = vma - > vm_mm ;
2007-07-19 01:48:16 -07:00
unsigned long old_start = vma - > vm_start ;
unsigned long old_end = vma - > vm_end ;
unsigned long length = old_end - old_start ;
unsigned long new_start = old_start - shift ;
unsigned long new_end = old_end - shift ;
2022-09-06 19:48:56 +00:00
VMA_ITERATOR ( vmi , mm , new_start ) ;
struct vm_area_struct * next ;
2011-05-24 17:11:45 -07:00
struct mmu_gather tlb ;
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
BUG_ON ( new_start > new_end ) ;
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
/*
* ensure there are no vmas between where we want to go
* and where we are
*/
2022-09-06 19:48:56 +00:00
if ( vma ! = vma_next ( & vmi ) )
2007-07-19 01:48:16 -07:00
return - EFAULT ;
/*
* cover the whole range : [ new_start , old_end )
*/
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 13:42:07 -08:00
if ( vma_adjust ( vma , new_start , old_end , vma - > vm_pgoff , NULL ) )
return - ENOMEM ;
2007-07-19 01:48:16 -07:00
/*
* move the page tables downwards , on failure we rely on
* process cleanup to remove whatever mess we made .
*/
if ( length ! = move_page_tables ( vma , old_start ,
2012-10-08 16:31:50 -07:00
vma , new_start , length , false ) )
2007-07-19 01:48:16 -07:00
return - ENOMEM ;
lru_add_drain ( ) ;
2021-01-27 23:53:45 +00:00
tlb_gather_mmu ( & tlb , mm ) ;
2022-09-06 19:48:56 +00:00
next = vma_next ( & vmi ) ;
2007-07-19 01:48:16 -07:00
if ( new_end > old_start ) {
/*
* when the old and new regions overlap clear from new_end .
*/
2011-05-24 17:11:45 -07:00
free_pgd_range ( & tlb , new_end , old_end , new_end ,
2022-09-06 19:48:56 +00:00
next ? next - > vm_start : USER_PGTABLES_CEILING ) ;
2007-07-19 01:48:16 -07:00
} else {
/*
* otherwise , clean from old_start ; this is done to not touch
* the address space in [ new_end , old_start ) some architectures
* have constraints on va - space that make this illegal ( IA64 ) -
* for the others its just a little faster .
*/
2011-05-24 17:11:45 -07:00
free_pgd_range ( & tlb , old_start , old_end , new_end ,
2022-09-06 19:48:56 +00:00
next ? next - > vm_start : USER_PGTABLES_CEILING ) ;
2005-04-16 15:20:36 -07:00
}
2021-01-27 23:53:43 +00:00
tlb_finish_mmu ( & tlb ) ;
2007-07-19 01:48:16 -07:00
/*
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 13:42:07 -08:00
* Shrink the vma to just the new range . Always succeeds .
2007-07-19 01:48:16 -07:00
*/
vma_adjust ( vma , new_start , new_end , vma - > vm_pgoff , NULL ) ;
return 0 ;
2005-04-16 15:20:36 -07:00
}
2007-07-19 01:48:16 -07:00
/*
* Finalizes the stack vm_area_struct . The flags and permissions are updated ,
* the stack is optionally relocated , and some extra space is added .
*/
2005-04-16 15:20:36 -07:00
int setup_arg_pages ( struct linux_binprm * bprm ,
unsigned long stack_top ,
int executable_stack )
{
2007-07-19 01:48:16 -07:00
unsigned long ret ;
unsigned long stack_shift ;
2005-04-16 15:20:36 -07:00
struct mm_struct * mm = current - > mm ;
2007-07-19 01:48:16 -07:00
struct vm_area_struct * vma = bprm - > vma ;
struct vm_area_struct * prev = NULL ;
unsigned long vm_flags ;
unsigned long stack_base ;
2010-02-10 13:56:42 -08:00
unsigned long stack_size ;
unsigned long stack_expand ;
unsigned long rlim_stack ;
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:50 -07:00
struct mmu_gather tlb ;
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_STACK_GROWSUP
2014-05-13 23:58:24 +01:00
/* Limit stack size */
2018-04-10 16:35:01 -07:00
stack_base = bprm - > rlim_stack . rlim_max ;
2020-11-06 19:41:36 +01:00
stack_base = calc_max_stack_size ( stack_base ) ;
2005-04-16 15:20:36 -07:00
2015-05-11 22:01:27 +02:00
/* Add space for stack randomization. */
stack_base + = ( STACK_RND_MASK < < PAGE_SHIFT ) ;
2007-07-19 01:48:16 -07:00
/* Make sure we didn't let the argument array grow too large. */
if ( vma - > vm_end - vma - > vm_start > stack_base )
return - ENOMEM ;
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
stack_base = PAGE_ALIGN ( stack_top - stack_base ) ;
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
stack_shift = vma - > vm_start - stack_base ;
mm - > arg_start = bprm - > p - stack_shift ;
bprm - > p = vma - > vm_end - stack_shift ;
2005-04-16 15:20:36 -07:00
# else
2007-07-19 01:48:16 -07:00
stack_top = arch_align_stack ( stack_top ) ;
stack_top = PAGE_ALIGN ( stack_top ) ;
2010-09-07 19:35:49 -07:00
if ( unlikely ( stack_top < mmap_min_addr ) | |
unlikely ( vma - > vm_end - vma - > vm_start > = stack_top - mmap_min_addr ) )
return - ENOMEM ;
2007-07-19 01:48:16 -07:00
stack_shift = vma - > vm_end - stack_top ;
bprm - > p - = stack_shift ;
2005-04-16 15:20:36 -07:00
mm - > arg_start = bprm - > p ;
# endif
if ( bprm - > loader )
2007-07-19 01:48:16 -07:00
bprm - > loader - = stack_shift ;
bprm - > exec - = stack_shift ;
2005-04-16 15:20:36 -07:00
2020-06-08 21:33:25 -07:00
if ( mmap_write_lock_killable ( mm ) )
2016-05-23 16:26:02 -07:00
return - EINTR ;
2008-07-10 21:19:20 +01:00
vm_flags = VM_STACK_FLAGS ;
2007-07-19 01:48:16 -07:00
/*
* Adjust stack execute permissions ; explicitly enable for
* EXSTACK_ENABLE_X , disable for EXSTACK_DISABLE_X and leave alone
* ( arch default ) otherwise .
*/
if ( unlikely ( executable_stack = = EXSTACK_ENABLE_X ) )
vm_flags | = VM_EXEC ;
else if ( executable_stack = = EXSTACK_DISABLE_X )
vm_flags & = ~ VM_EXEC ;
vm_flags | = mm - > def_flags ;
2010-05-24 14:32:24 -07:00
vm_flags | = VM_STACK_INCOMPLETE_SETUP ;
2007-07-19 01:48:16 -07:00
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:50 -07:00
tlb_gather_mmu ( & tlb , mm ) ;
ret = mprotect_fixup ( & tlb , vma , & prev , vma - > vm_start , vma - > vm_end ,
2007-07-19 01:48:16 -07:00
vm_flags ) ;
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:50 -07:00
tlb_finish_mmu ( & tlb ) ;
2007-07-19 01:48:16 -07:00
if ( ret )
goto out_unlock ;
BUG_ON ( prev ! = vma ) ;
2020-01-30 22:17:29 -08:00
if ( unlikely ( vm_flags & VM_EXEC ) ) {
pr_warn_once ( " process '%pD4' started with executable stack \n " ,
bprm - > file ) ;
}
2007-07-19 01:48:16 -07:00
/* Move stack pages down in memory. */
if ( stack_shift ) {
ret = shift_arg_pages ( vma , stack_shift ) ;
2009-11-11 14:26:48 -08:00
if ( ret )
goto out_unlock ;
2005-04-16 15:20:36 -07:00
}
2010-05-24 14:32:24 -07:00
/* mprotect_fixup is overkill to remove the temporary stack flags */
vma - > vm_flags & = ~ VM_STACK_INCOMPLETE_SETUP ;
2010-03-05 13:42:57 -08:00
stack_expand = 131072UL ; /* randomly 32*4k (or 2*64k) pages */
2010-02-10 13:56:42 -08:00
stack_size = vma - > vm_end - vma - > vm_start ;
/*
* Align this down to a page boundary as expand_stack
* will align it up .
*/
2018-04-10 16:35:01 -07:00
rlim_stack = bprm - > rlim_stack . rlim_cur & PAGE_MASK ;
2007-07-19 01:48:16 -07:00
# ifdef CONFIG_STACK_GROWSUP
2010-02-10 13:56:42 -08:00
if ( stack_size + stack_expand > rlim_stack )
stack_base = vma - > vm_start + rlim_stack ;
else
stack_base = vma - > vm_end + stack_expand ;
2007-07-19 01:48:16 -07:00
# else
2010-02-10 13:56:42 -08:00
if ( stack_size + stack_expand > rlim_stack )
stack_base = vma - > vm_end - rlim_stack ;
else
stack_base = vma - > vm_start - stack_expand ;
2007-07-19 01:48:16 -07:00
# endif
2010-05-18 15:30:49 +01:00
current - > mm - > start_stack = bprm - > p ;
2007-07-19 01:48:16 -07:00
ret = expand_stack ( vma , stack_base ) ;
if ( ret )
ret = - EFAULT ;
out_unlock :
2020-06-08 21:33:25 -07:00
mmap_write_unlock ( mm ) ;
2009-11-11 14:26:48 -08:00
return ret ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( setup_arg_pages ) ;
2016-07-24 11:30:18 -04:00
# else
/*
* Transfer the program arguments and environment from the holding pages
* onto the stack . The provided stack pointer is adjusted accordingly .
*/
int transfer_args_to_stack ( struct linux_binprm * bprm ,
unsigned long * sp_location )
{
unsigned long index , stop , sp ;
int ret = 0 ;
stop = bprm - > p > > PAGE_SHIFT ;
sp = * sp_location ;
for ( index = MAX_ARG_PAGES - 1 ; index > = stop ; index - - ) {
unsigned int offset = index = = stop ? bprm - > p & ~ PAGE_MASK : 0 ;
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
char * src = kmap_local_page ( bprm - > page [ index ] ) + offset ;
2016-07-24 11:30:18 -04:00
sp - = PAGE_SIZE - offset ;
if ( copy_to_user ( ( void * ) sp , src , PAGE_SIZE - offset ) ! = 0 )
ret = - EFAULT ;
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kunmap_local ( src ) ;
2016-07-24 11:30:18 -04:00
if ( ret )
goto out ;
}
* sp_location = sp ;
out :
return ret ;
}
EXPORT_SYMBOL ( transfer_args_to_stack ) ;
2005-04-16 15:20:36 -07:00
# endif /* CONFIG_MMU */
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
static struct file * do_open_execat ( int fd , struct filename * name , int flags )
2005-04-16 15:20:36 -07:00
{
struct file * file ;
2008-05-19 07:53:34 +02:00
int err ;
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
struct open_flags open_exec_flags = {
2011-02-23 17:44:09 -05:00
. open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC ,
2015-12-26 22:33:24 -05:00
. acc_mode = MAY_EXEC ,
2013-06-11 08:23:01 +04:00
. intent = LOOKUP_OPEN ,
. lookup_flags = LOOKUP_FOLLOW ,
2011-02-23 17:44:09 -05:00
} ;
2005-04-16 15:20:36 -07:00
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
if ( ( flags & ~ ( AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH ) ) ! = 0 )
return ERR_PTR ( - EINVAL ) ;
if ( flags & AT_SYMLINK_NOFOLLOW )
open_exec_flags . lookup_flags & = ~ LOOKUP_FOLLOW ;
if ( flags & AT_EMPTY_PATH )
open_exec_flags . lookup_flags | = LOOKUP_EMPTY ;
file = do_filp_open ( fd , name , & open_exec_flags ) ;
2009-04-06 11:16:22 -04:00
if ( IS_ERR ( file ) )
2008-05-19 07:53:34 +02:00
goto out ;
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:26 -07:00
/*
* may_open ( ) has already checked for this , so it should be
* impossible to trip now . But we need to be extra cautious
* and check again at the very end too .
*/
2008-05-19 07:53:34 +02:00
err = - EACCES ;
exec: move path_noexec() check earlier
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11 18:36:30 -07:00
if ( WARN_ON_ONCE ( ! S_ISREG ( file_inode ( file ) - > i_mode ) | |
path_noexec ( & file - > f_path ) ) )
2009-04-06 11:16:22 -04:00
goto exit ;
2008-05-19 07:53:34 +02:00
err = deny_write_access ( file ) ;
2009-04-06 11:16:22 -04:00
if ( err )
goto exit ;
2005-04-16 15:20:36 -07:00
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
if ( name - > name [ 0 ] ! = ' \0 ' )
fsnotify_open ( file ) ;
2009-04-06 11:16:22 -04:00
out :
2008-05-19 07:53:34 +02:00
return file ;
2009-04-06 11:16:22 -04:00
exit :
fput ( file ) ;
2008-05-19 07:53:34 +02:00
return ERR_PTR ( err ) ;
}
2014-02-05 12:54:53 -08:00
struct file * open_exec ( const char * name )
{
2015-01-22 00:00:03 -05:00
struct filename * filename = getname_kernel ( name ) ;
struct file * f = ERR_CAST ( filename ) ;
if ( ! IS_ERR ( filename ) ) {
f = do_open_execat ( AT_FDCWD , filename , 0 ) ;
putname ( filename ) ;
}
return f ;
2014-02-05 12:54:53 -08:00
}
2005-04-16 15:20:36 -07:00
EXPORT_SYMBOL ( open_exec ) ;
2022-09-26 17:15:32 -05:00
# if defined(CONFIG_BINFMT_FLAT) || defined(CONFIG_BINFMT_ELF_FDPIC)
2013-04-13 20:31:37 -04:00
ssize_t read_code ( struct file * file , unsigned long addr , loff_t pos , size_t len )
{
2014-02-04 19:08:21 -05:00
ssize_t res = vfs_read ( file , ( void __user * ) addr , len , & pos ) ;
2013-04-13 20:31:37 -04:00
if ( res > 0 )
2020-06-07 21:42:43 -07:00
flush_icache_user_range ( addr , addr + len ) ;
2013-04-13 20:31:37 -04:00
return res ;
}
EXPORT_SYMBOL ( read_code ) ;
2020-06-07 21:42:40 -07:00
# endif
2013-04-13 20:31:37 -04:00
2020-03-25 10:03:36 -05:00
/*
* Maps the mm_struct mm into the current task struct .
2020-12-03 14:12:00 -06:00
* On success , this function returns with exec_update_lock
* held for writing .
2020-03-25 10:03:36 -05:00
*/
2005-04-16 15:20:36 -07:00
static int exec_mmap ( struct mm_struct * mm )
{
struct task_struct * tsk ;
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:37:25 -07:00
struct mm_struct * old_mm , * active_mm ;
2020-03-25 10:03:36 -05:00
int ret ;
2005-04-16 15:20:36 -07:00
/* Notify parent that we're no longer interested in the old VM */
tsk = current ;
old_mm = current - > mm ;
2019-11-06 22:55:38 +01:00
exec_mm_release ( tsk , old_mm ) ;
2020-03-30 16:33:39 -05:00
if ( old_mm )
sync_mm_rss ( old_mm ) ;
2005-04-16 15:20:36 -07:00
2020-12-03 14:12:00 -06:00
ret = down_write_killable ( & tsk - > signal - > exec_update_lock ) ;
2020-03-25 10:03:36 -05:00
if ( ret )
return ret ;
2005-04-16 15:20:36 -07:00
if ( old_mm ) {
/*
2021-09-03 10:26:05 -05:00
* If there is a pending fatal signal perhaps a signal
* whose default action is to create a coredump get
* out and die instead of going through with the exec .
2005-04-16 15:20:36 -07:00
*/
2021-09-03 10:26:05 -05:00
ret = mmap_read_lock_killable ( old_mm ) ;
if ( ret ) {
2020-12-03 14:12:00 -06:00
up_write ( & tsk - > signal - > exec_update_lock ) ;
2021-09-03 10:26:05 -05:00
return ret ;
2005-04-16 15:20:36 -07:00
}
}
2020-03-25 10:03:36 -05:00
2005-04-16 15:20:36 -07:00
task_lock ( tsk ) ;
2019-09-19 13:37:02 -04:00
membarrier_exec_mmap ( mm ) ;
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 14:52:16 +10:00
local_irq_disable ( ) ;
active_mm = tsk - > active_mm ;
2005-04-16 15:20:36 -07:00
tsk - > active_mm = mm ;
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 14:52:16 +10:00
tsk - > mm = mm ;
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated , which could cause problems for
* lazy tlb mm refcounting when these are updated by context
* switches . Not all architectures can handle irqs off over
* activate_mm yet .
*/
if ( ! IS_ENABLED ( CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM ) )
local_irq_enable ( ) ;
2005-04-16 15:20:36 -07:00
activate_mm ( active_mm , mm ) ;
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 14:52:16 +10:00
if ( IS_ENABLED ( CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM ) )
local_irq_enable ( ) ;
2022-10-26 15:48:30 +02:00
lru_gen_add_mm ( mm ) ;
2005-04-16 15:20:36 -07:00
task_unlock ( tsk ) ;
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 02:00:05 -06:00
lru_gen_use_mm ( mm ) ;
2005-04-16 15:20:36 -07:00
if ( old_mm ) {
2020-06-08 21:33:25 -07:00
mmap_read_unlock ( old_mm ) ;
2006-04-01 01:13:38 +02:00
BUG_ON ( active_mm ! = old_mm ) ;
2012-03-19 17:04:01 +01:00
setmax_mm_hiwater_rss ( & tsk - > signal - > maxrss , old_mm ) ;
mm owner: fix race between swapoff and exit
There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on. The condition occurs when
try_to_unuse() runs in parallel with an exiting task. A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.
CPU0 CPU1
try_to_unuse
looks at mm = task0->mm
increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
mmput(mm) decrements mm->mm_users
task0 freed
dereferencing mm->owner fails
The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.
Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.
Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.
Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same. And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.
Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-28 23:09:31 +01:00
mm_update_next_owner ( old_mm ) ;
2005-04-16 15:20:36 -07:00
mmput ( old_mm ) ;
return 0 ;
}
mmdrop ( active_mm ) ;
return 0 ;
}
2006-01-14 13:20:43 -08:00
static int de_thread ( struct task_struct * tsk )
2005-04-16 15:20:36 -07:00
{
struct signal_struct * sig = tsk - > signal ;
2007-10-16 23:27:22 -07:00
struct sighand_struct * oldsighand = tsk - > sighand ;
2005-04-16 15:20:36 -07:00
spinlock_t * lock = & oldsighand - > siglock ;
2006-09-27 01:51:13 -07:00
if ( thread_group_empty ( tsk ) )
2005-04-16 15:20:36 -07:00
goto no_thread_group ;
/*
* Kill all other threads in the thread group .
*/
spin_lock_irq ( lock ) ;
2021-06-24 02:14:30 -05:00
if ( ( sig - > flags & SIGNAL_GROUP_EXIT ) | | sig - > group_exec_task ) {
2005-04-16 15:20:36 -07:00
/*
* Another group action in progress , just
* return so that the signal is processed .
*/
spin_unlock_irq ( lock ) ;
return - EAGAIN ;
}
2010-05-26 14:43:11 -07:00
2021-06-06 13:47:53 -05:00
sig - > group_exec_task = tsk ;
2010-05-26 14:43:11 -07:00
sig - > notify_count = zap_other_threads ( tsk ) ;
if ( ! thread_group_leader ( tsk ) )
sig - > notify_count - - ;
2005-04-16 15:20:36 -07:00
2010-05-26 14:43:11 -07:00
while ( sig - > notify_count ) {
2012-10-08 19:13:01 +02:00
__set_current_state ( TASK_KILLABLE ) ;
2005-04-16 15:20:36 -07:00
spin_unlock_irq ( lock ) ;
2018-12-03 13:04:18 +01:00
schedule ( ) ;
2019-01-03 15:28:58 -08:00
if ( __fatal_signal_pending ( tsk ) )
2012-10-08 19:13:01 +02:00
goto killed ;
2005-04-16 15:20:36 -07:00
spin_lock_irq ( lock ) ;
}
spin_unlock_irq ( lock ) ;
/*
* At this point all other threads have exited , all we have to
* do is to wait for the thread group leader to become inactive ,
* and to assume its PID :
*/
2006-09-27 01:51:13 -07:00
if ( ! thread_group_leader ( tsk ) ) {
2008-12-01 14:18:16 -08:00
struct task_struct * leader = tsk - > group_leader ;
2007-10-16 23:27:23 -07:00
for ( ; ; ) {
2017-02-02 11:50:56 +01:00
cgroup_threadgroup_change_begin ( tsk ) ;
2007-10-16 23:27:23 -07:00
write_lock_irq ( & tasklist_lock ) ;
2015-04-16 12:48:01 -07:00
/*
* Do this under tasklist_lock to ensure that
2021-06-06 13:47:53 -05:00
* exit_notify ( ) can ' t miss - > group_exec_task
2015-04-16 12:48:01 -07:00
*/
sig - > notify_count = - 1 ;
2007-10-16 23:27:23 -07:00
if ( likely ( leader - > exit_state ) )
break ;
2012-10-08 19:13:01 +02:00
__set_current_state ( TASK_KILLABLE ) ;
2007-10-16 23:27:23 -07:00
write_unlock_irq ( & tasklist_lock ) ;
2017-02-02 11:50:56 +01:00
cgroup_threadgroup_change_end ( tsk ) ;
2018-12-03 13:04:18 +01:00
schedule ( ) ;
2019-01-03 15:28:58 -08:00
if ( __fatal_signal_pending ( tsk ) )
2012-10-08 19:13:01 +02:00
goto killed ;
2007-10-16 23:27:23 -07:00
}
2005-04-16 15:20:36 -07:00
2006-04-10 22:54:16 -07:00
/*
* The only record we have of the real - time age of a
* process , regardless of execs it ' s done , is start_time .
* All the past CPU time is accumulated in signal_struct
* from sister threads now dead . But in this non - leader
* exec , nothing survives from the original leader thread ,
* whose birth marks the true age of this process now .
* When we take on its identity by switching to its PID , we
* also take its birthdate ( always earlier than our own ) .
*/
2006-09-27 01:51:13 -07:00
tsk - > start_time = leader - > start_time ;
2019-11-07 11:07:58 +01:00
tsk - > start_boottime = leader - > start_boottime ;
2006-04-10 22:54:16 -07:00
2007-10-18 23:40:18 -07:00
BUG_ON ( ! same_thread_group ( leader , tsk ) ) ;
2005-04-16 15:20:36 -07:00
/*
* An exec ( ) starts a new thread group with the
* TGID of the previous thread group . Rehash the
* two threads with a switched PID , and release
* the former thread group leader :
*/
2006-03-28 16:11:03 -08:00
/* Become a process group leader with the old leader's pid.
2006-09-27 01:51:06 -07:00
* The old leader becomes a thread of the this thread group .
2006-03-28 16:11:03 -08:00
*/
2020-04-19 06:35:02 -05:00
exchange_tids ( tsk , leader ) ;
2017-06-04 04:32:13 -05:00
transfer_pid ( leader , tsk , PIDTYPE_TGID ) ;
2006-09-27 01:51:13 -07:00
transfer_pid ( leader , tsk , PIDTYPE_PGID ) ;
transfer_pid ( leader , tsk , PIDTYPE_SID ) ;
2009-12-17 15:27:15 -08:00
2006-09-27 01:51:13 -07:00
list_replace_rcu ( & leader - > tasks , & tsk - > tasks ) ;
2009-12-17 15:27:15 -08:00
list_replace_init ( & leader - > sibling , & tsk - > sibling ) ;
2005-04-16 15:20:36 -07:00
2006-09-27 01:51:13 -07:00
tsk - > group_leader = tsk ;
leader - > group_leader = tsk ;
2006-04-10 17:16:49 -06:00
2006-09-27 01:51:13 -07:00
tsk - > exit_signal = SIGCHLD ;
2011-06-22 23:10:26 +02:00
leader - > exit_signal = - 1 ;
2005-11-23 13:37:43 -08:00
BUG_ON ( leader - > exit_state ! = EXIT_ZOMBIE ) ;
leader - > exit_state = EXIT_DEAD ;
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2011-07-21 20:00:43 +02:00
/*
* We are going to release_task ( ) - > ptrace_unlink ( ) silently ,
* the tracer can sleep in do_wait ( ) . EXIT_DEAD guarantees
2022-06-29 15:29:32 +08:00
* the tracer won ' t block again waiting for this thread .
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2011-07-21 20:00:43 +02:00
*/
if ( unlikely ( leader - > ptrace ) )
__wake_up_parent ( leader , leader - > parent ) ;
2005-04-16 15:20:36 -07:00
write_unlock_irq ( & tasklist_lock ) ;
2017-02-02 11:50:56 +01:00
cgroup_threadgroup_change_end ( tsk ) ;
2008-12-01 14:18:16 -08:00
release_task ( leader ) ;
2008-02-04 22:27:24 -08:00
}
2005-04-16 15:20:36 -07:00
2021-06-06 13:47:53 -05:00
sig - > group_exec_task = NULL ;
2007-10-16 23:27:23 -07:00
sig - > notify_count = 0 ;
2005-04-16 15:20:36 -07:00
no_thread_group :
2012-03-19 17:03:22 +01:00
/* we have changed execution domain */
tsk - > exit_signal = SIGCHLD ;
2020-03-25 10:00:21 -05:00
BUG_ON ( ! thread_group_leader ( tsk ) ) ;
return 0 ;
killed :
/* protects against exit_notify() and __exit_signal() */
read_lock ( & tasklist_lock ) ;
2021-06-06 13:47:53 -05:00
sig - > group_exec_task = NULL ;
2020-03-25 10:00:21 -05:00
sig - > notify_count = 0 ;
read_unlock ( & tasklist_lock ) ;
return - EAGAIN ;
}
2020-03-08 12:04:44 -05:00
/*
* This function makes sure the current process has its own signal table ,
* so that flush_signal_handlers can later reset the handlers without
* disturbing other processes . ( Other processes might share the signal
* table via the CLONE_SIGHAND option to clone ( ) . )
*/
2020-03-25 10:00:21 -05:00
static int unshare_sighand ( struct task_struct * me )
{
struct sighand_struct * oldsighand = me - > sighand ;
2005-11-07 21:12:43 +03:00
2019-01-18 14:27:26 +02:00
if ( refcount_read ( & oldsighand - > count ) ! = 1 ) {
2007-10-16 23:27:22 -07:00
struct sighand_struct * newsighand ;
2005-04-16 15:20:36 -07:00
/*
2007-10-16 23:27:22 -07:00
* This - > sighand is shared with the CLONE_SIGHAND
* but not CLONE_THREAD task , switch to the new one .
2005-04-16 15:20:36 -07:00
*/
2007-10-16 23:27:22 -07:00
newsighand = kmem_cache_alloc ( sighand_cachep , GFP_KERNEL ) ;
if ( ! newsighand )
return - ENOMEM ;
2019-01-18 14:27:26 +02:00
refcount_set ( & newsighand - > count , 1 ) ;
2005-04-16 15:20:36 -07:00
write_lock_irq ( & tasklist_lock ) ;
spin_lock ( & oldsighand - > siglock ) ;
2021-06-07 15:54:27 +02:00
memcpy ( newsighand - > action , oldsighand - > action ,
sizeof ( newsighand - > action ) ) ;
2020-03-25 10:00:21 -05:00
rcu_assign_pointer ( me - > sighand , newsighand ) ;
2005-04-16 15:20:36 -07:00
spin_unlock ( & oldsighand - > siglock ) ;
write_unlock_irq ( & tasklist_lock ) ;
signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 22:23:13 -07:00
__cleanup_sighand ( oldsighand ) ;
2005-04-16 15:20:36 -07:00
}
return 0 ;
}
2007-10-16 23:27:22 -07:00
2017-12-14 15:32:41 -08:00
char * __get_task_comm ( char * buf , size_t buf_size , struct task_struct * tsk )
2005-04-16 15:20:36 -07:00
{
task_lock ( tsk ) ;
2022-01-19 18:08:22 -08:00
/* Always NUL terminated and zero-padded */
strscpy_pad ( buf , tsk - > comm , buf_size ) ;
2005-04-16 15:20:36 -07:00
task_unlock ( tsk ) ;
2008-02-04 22:27:21 -08:00
return buf ;
2005-04-16 15:20:36 -07:00
}
2017-12-14 15:32:41 -08:00
EXPORT_SYMBOL_GPL ( __get_task_comm ) ;
2005-04-16 15:20:36 -07:00
2012-08-21 09:56:33 -04:00
/*
* These functions flushes out all traces of the currently running executable
* so that a new one can be started
*/
2014-05-28 11:45:04 +03:00
void __set_task_comm ( struct task_struct * tsk , const char * buf , bool exec )
2005-04-16 15:20:36 -07:00
{
task_lock ( tsk ) ;
2012-01-10 15:08:09 -08:00
trace_task_rename ( tsk , buf ) ;
2022-01-19 18:08:19 -08:00
strscpy_pad ( tsk - > comm , buf , sizeof ( tsk - > comm ) ) ;
2005-04-16 15:20:36 -07:00
task_unlock ( tsk ) ;
2014-05-28 11:45:04 +03:00
perf_event_comm ( tsk , exec ) ;
2005-04-16 15:20:36 -07:00
}
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 15:25:30 -07:00
/*
* Calling this is the point of no return . None of the failures will be
* seen by userspace since either the process is already taking a fatal
* signal ( via de_thread ( ) or coredump ) , or will have SEGV raised
2020-03-19 17:16:12 -05:00
* ( after exec_mmap ( ) ) by search_binary_handler ( see below ) .
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 15:25:30 -07:00
*/
2020-05-03 07:54:10 -05:00
int begin_new_exec ( struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
2020-03-25 10:00:07 -05:00
struct task_struct * me = current ;
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
int retval ;
2005-04-16 15:20:36 -07:00
2020-05-29 22:00:54 -05:00
/* Once we are committed compute the creds */
retval = bprm_creds_from_file ( bprm ) ;
if ( retval )
return retval ;
2020-04-04 12:01:37 -05:00
/*
* Ensure all future errors are fatal .
*/
bprm - > point_of_no_return = true ;
2005-04-16 15:20:36 -07:00
/*
2020-03-25 10:00:21 -05:00
* Make this the only thread in the thread group .
2005-04-16 15:20:36 -07:00
*/
2020-03-25 10:00:07 -05:00
retval = de_thread ( me ) ;
2005-04-16 15:20:36 -07:00
if ( retval )
goto out ;
2020-11-30 16:58:46 -06:00
/*
* Cancel any io_uring activity across execve
*/
io_uring_task_cancel ( ) ;
2020-11-20 17:14:18 -06:00
/* Ensure the files table is not shared. */
2020-11-20 17:14:19 -06:00
retval = unshare_files ( ) ;
2020-11-20 17:14:18 -06:00
if ( retval )
goto out ;
2015-04-16 12:47:59 -07:00
/*
* Must be called _before_ exec_mmap ( ) as bprm - > mm is
2022-02-11 08:09:40 -08:00
* not visible until then . This also enables the update
2015-04-16 12:47:59 -07:00
* to be lockless .
*/
2021-04-23 10:29:59 +02:00
retval = set_mm_exe_file ( bprm - > mm , bprm - > file ) ;
if ( retval )
goto out ;
2015-04-16 12:47:59 -07:00
2020-05-14 15:17:40 -05:00
/* If the binary is not readable then enforce mm->dumpable=0 */
2020-05-16 16:29:20 -05:00
would_dump ( bprm , bprm - > file ) ;
2020-05-14 15:17:40 -05:00
if ( bprm - > have_execfd )
would_dump ( bprm , bprm - > executable ) ;
2020-05-16 16:29:20 -05:00
2005-04-16 15:20:36 -07:00
/*
* Release all of the old mmap stuff
*/
2010-11-30 20:55:34 +01:00
acct_arg_size ( bprm , 0 ) ;
2005-04-16 15:20:36 -07:00
retval = exec_mmap ( bprm - > mm ) ;
if ( retval )
2008-04-22 05:11:59 -04:00
goto out ;
2005-04-16 15:20:36 -07:00
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 15:25:30 -07:00
bprm - > mm = NULL ;
2010-02-02 12:37:44 -08:00
2020-03-25 10:03:21 -05:00
# ifdef CONFIG_POSIX_TIMERS
2022-08-09 14:07:51 -03:00
spin_lock_irq ( & me - > sighand - > siglock ) ;
posix_cpu_timers_exit ( me ) ;
spin_unlock_irq ( & me - > sighand - > siglock ) ;
2022-07-11 18:16:25 +02:00
exit_itimers ( me ) ;
2020-03-25 10:03:21 -05:00
flush_itimer_signals ( ) ;
# endif
/*
* Make the signal table private .
*/
retval = unshare_sighand ( me ) ;
if ( retval )
2020-04-02 18:04:54 -05:00
goto out_unlock ;
2020-03-25 10:03:21 -05:00
2022-04-11 14:15:54 -05:00
me - > flags & = ~ ( PF_RANDOMIZE | PF_FORKNOEXEC |
2014-01-23 15:55:57 -08:00
PF_NOFREEZE | PF_NO_SETAFFINITY ) ;
2010-02-02 12:37:44 -08:00
flush_thread ( ) ;
2020-03-25 10:00:07 -05:00
me - > personality & = ~ bprm - > per_clear ;
2010-02-02 12:37:44 -08:00
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-27 14:32:34 -05:00
clear_syscall_work_syscall_user_dispatch ( me ) ;
2016-12-21 16:26:24 +11:00
/*
* We have to apply CLOEXEC before we change whether the process is
* dumpable ( in setup_new_exec ) to avoid a race with a process in userspace
* trying to access the should - be - closed file descriptors of a process
* undergoing exec ( 2 ) .
*/
2020-03-25 10:00:07 -05:00
do_close_on_exec ( me - > files ) ;
2016-11-16 22:06:51 -06:00
2017-07-18 15:25:35 -07:00
if ( bprm - > secureexec ) {
2017-07-18 15:25:36 -07:00
/* Make sure parent cannot signal privileged process. */
2020-04-02 18:35:14 -05:00
me - > pdeath_signal = 0 ;
2017-07-18 15:25:36 -07:00
2017-07-18 15:25:35 -07:00
/*
* For secureexec , reset the stack limit to sane default to
* avoid bad behavior from the prior rlimits . This has to
* happen before arch_pick_mmap_layout ( ) , which examines
* RLIMIT_STACK , but after the point of no return to avoid
2017-12-12 11:28:38 -08:00
* needing to clean up the change on failure .
2017-07-18 15:25:35 -07:00
*/
2018-04-10 16:35:01 -07:00
if ( bprm - > rlim_stack . rlim_cur > _STK_LIM )
bprm - > rlim_stack . rlim_cur = _STK_LIM ;
2017-07-18 15:25:35 -07:00
}
2020-04-02 18:35:14 -05:00
me - > sas_ss_sp = me - > sas_ss_size = 0 ;
2005-04-16 15:20:36 -07:00
2018-01-02 15:21:33 -08:00
/*
* Figure out dumpability . Note that this checking only of current
* is wrong , but userspace depends on it . This should be testing
* bprm - > secureexec instead .
*/
2017-07-18 15:25:34 -07:00
if ( bprm - > interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP | |
2018-01-02 15:21:33 -08:00
! ( uid_eq ( current_euid ( ) , current_uid ( ) ) & &
gid_eq ( current_egid ( ) , current_gid ( ) ) ) )
2007-07-19 01:48:27 -07:00
set_dumpable ( current - > mm , suid_dumpable ) ;
2017-07-18 15:25:34 -07:00
else
set_dumpable ( current - > mm , SUID_DUMP_USER ) ;
2005-06-23 00:09:43 -07:00
2014-05-21 17:32:19 +02:00
perf_event_exec ( ) ;
2020-04-02 18:35:14 -05:00
__set_task_comm ( me , kbasename ( bprm - > filename ) , true ) ;
2005-04-16 15:20:36 -07:00
/* An exec changes our domain. We are no longer part of the thread
group */
2020-04-02 18:35:14 -05:00
WRITE_ONCE ( me - > self_exec_id , me - > self_exec_id + 1 ) ;
flush_signal_handlers ( me , 0 ) ;
2020-05-03 06:48:17 -05:00
2021-04-22 14:27:09 +02:00
retval = set_cred_ucounts ( bprm - > cred ) ;
if ( retval < 0 )
goto out_unlock ;
2020-05-03 06:48:17 -05:00
/*
* install the new credentials for this executable
*/
security_bprm_committing_creds ( bprm ) ;
commit_creds ( bprm - > cred ) ;
bprm - > cred = NULL ;
/*
* Disable monitoring for regular users
* when executing setuid binaries . Must
* wait until new credentials are committed
* by commit_creds ( ) above
*/
2020-04-02 18:35:14 -05:00
if ( get_dumpable ( me - > mm ) ! = SUID_DUMP_USER )
perf_event_exit_task ( me ) ;
2020-05-03 06:48:17 -05:00
/*
* cred_guard_mutex must be held at least to this point to prevent
* ptrace_attach ( ) from altering our determination of the task ' s
* credentials ; any time after this it may be unlocked .
*/
security_bprm_committed_creds ( bprm ) ;
2020-05-14 15:17:40 -05:00
/* Pass the opened binary to the interpreter. */
if ( bprm - > have_execfd ) {
retval = get_unused_fd_flags ( 0 ) ;
if ( retval < 0 )
goto out_unlock ;
fd_install ( retval , bprm - > executable ) ;
bprm - > executable = NULL ;
bprm - > execfd = retval ;
}
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
return 0 ;
2020-05-03 07:15:28 -05:00
out_unlock :
2020-12-03 14:12:00 -06:00
up_write ( & me - > signal - > exec_update_lock ) ;
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
out :
return retval ;
}
2020-05-03 07:54:10 -05:00
EXPORT_SYMBOL ( begin_new_exec ) ;
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
2011-06-19 12:49:47 -04:00
void would_dump ( struct linux_binprm * bprm , struct file * file )
{
2016-11-16 22:06:51 -06:00
struct inode * inode = file_inode ( file ) ;
2021-01-21 14:19:41 +01:00
struct user_namespace * mnt_userns = file_mnt_user_ns ( file ) ;
if ( inode_permission ( mnt_userns , inode , MAY_READ ) < 0 ) {
2016-11-16 22:06:51 -06:00
struct user_namespace * old , * user_ns ;
2011-06-19 12:49:47 -04:00
bprm - > interp_flags | = BINPRM_FLAGS_ENFORCE_NONDUMP ;
2016-11-16 22:06:51 -06:00
/* Ensure mm->user_ns contains the executable */
user_ns = old = bprm - > mm - > user_ns ;
while ( ( user_ns ! = & init_user_ns ) & &
2021-01-21 14:19:41 +01:00
! privileged_wrt_inode_uidgid ( user_ns , mnt_userns , inode ) )
2016-11-16 22:06:51 -06:00
user_ns = user_ns - > parent ;
if ( old ! = user_ns ) {
bprm - > mm - > user_ns = get_user_ns ( user_ns ) ;
put_user_ns ( old ) ;
}
}
2011-06-19 12:49:47 -04:00
}
EXPORT_SYMBOL ( would_dump ) ;
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
void setup_new_exec ( struct linux_binprm * bprm )
{
2020-05-03 07:15:28 -05:00
/* Setup things that can depend upon the personality */
struct task_struct * me = current ;
2005-04-16 15:20:36 -07:00
2020-05-03 07:15:28 -05:00
arch_pick_mmap_layout ( me - > mm , & bprm - > rlim_stack ) ;
2005-06-23 00:09:43 -07:00
2017-03-20 01:16:26 -07:00
arch_setup_new_exec ( ) ;
2005-04-16 15:20:36 -07:00
2006-02-28 16:59:19 -08:00
/* Set the new mm task size. We have to do that late because it may
* depend on TIF_32BIT which is only updated in flush_thread ( ) on
* some architectures like powerpc
*/
2020-05-03 07:15:28 -05:00
me - > mm - > task_size = TASK_SIZE ;
2020-12-03 14:12:00 -06:00
up_write ( & me - > signal - > exec_update_lock ) ;
2020-04-02 18:35:14 -05:00
mutex_unlock ( & me - > signal - > cred_guard_mutex ) ;
2005-04-16 15:20:36 -07:00
}
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-28 22:14:42 -08:00
EXPORT_SYMBOL ( setup_new_exec ) ;
2005-04-16 15:20:36 -07:00
2018-04-10 16:34:57 -07:00
/* Runs immediately before start_thread() takes over. */
void finalize_exec ( struct linux_binprm * bprm )
{
2018-04-10 16:35:01 -07:00
/* Store any stack rlimit changes before starting thread. */
task_lock ( current - > group_leader ) ;
current - > signal - > rlim [ RLIMIT_STACK ] = bprm - > rlim_stack ;
task_unlock ( current - > group_leader ) ;
2018-04-10 16:34:57 -07:00
}
EXPORT_SYMBOL ( finalize_exec ) ;
2009-09-05 11:17:13 -07:00
/*
* Prepare credentials and lock - > cred_guard_mutex .
2020-05-03 06:48:17 -05:00
* setup_new_exec ( ) commits the new creds and drops the lock .
2021-02-24 12:00:48 -08:00
* Or , if exec fails before , free_bprm ( ) should release - > cred
2009-09-05 11:17:13 -07:00
* and unlock .
*/
2018-12-10 16:49:54 +09:00
static int prepare_bprm_creds ( struct linux_binprm * bprm )
2009-09-05 11:17:13 -07:00
{
2010-10-27 15:34:08 -07:00
if ( mutex_lock_interruptible ( & current - > signal - > cred_guard_mutex ) )
2009-09-05 11:17:13 -07:00
return - ERESTARTNOINTR ;
bprm - > cred = prepare_exec_creds ( ) ;
if ( likely ( bprm - > cred ) )
return 0 ;
2010-10-27 15:34:08 -07:00
mutex_unlock ( & current - > signal - > cred_guard_mutex ) ;
2009-09-05 11:17:13 -07:00
return - ENOMEM ;
}
2014-02-05 12:54:53 -08:00
static void free_bprm ( struct linux_binprm * bprm )
2009-09-05 11:17:13 -07:00
{
2020-07-10 15:54:54 -05:00
if ( bprm - > mm ) {
acct_arg_size ( bprm , 0 ) ;
mmput ( bprm - > mm ) ;
}
2009-09-05 11:17:13 -07:00
free_arg_pages ( bprm ) ;
if ( bprm - > cred ) {
2010-10-27 15:34:08 -07:00
mutex_unlock ( & current - > signal - > cred_guard_mutex ) ;
2009-09-05 11:17:13 -07:00
abort_creds ( bprm - > cred ) ;
}
2014-01-23 15:55:51 -08:00
if ( bprm - > file ) {
allow_write_access ( bprm - > file ) ;
fput ( bprm - > file ) ;
}
2020-05-14 15:17:40 -05:00
if ( bprm - > executable )
fput ( bprm - > executable ) ;
exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
For a proof of concept, see DoTest.sh from:
http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 15:05:16 -08:00
/* If a binfmt changed the interp, free it. */
if ( bprm - > interp ! = bprm - > filename )
kfree ( bprm - > interp ) ;
2020-07-11 08:16:15 -05:00
kfree ( bprm - > fdpath ) ;
2009-09-05 11:17:13 -07:00
kfree ( bprm ) ;
}
2020-07-11 08:16:15 -05:00
static struct linux_binprm * alloc_bprm ( int fd , struct filename * filename )
2020-07-10 15:39:45 -05:00
{
struct linux_binprm * bprm = kzalloc ( sizeof ( * bprm ) , GFP_KERNEL ) ;
2020-07-11 08:16:15 -05:00
int retval = - ENOMEM ;
2020-07-10 15:39:45 -05:00
if ( ! bprm )
2020-07-11 08:16:15 -05:00
goto out ;
if ( fd = = AT_FDCWD | | filename - > name [ 0 ] = = ' / ' ) {
bprm - > filename = filename - > name ;
} else {
if ( filename - > name [ 0 ] = = ' \0 ' )
bprm - > fdpath = kasprintf ( GFP_KERNEL , " /dev/fd/%d " , fd ) ;
else
bprm - > fdpath = kasprintf ( GFP_KERNEL , " /dev/fd/%d/%s " ,
fd , filename - > name ) ;
if ( ! bprm - > fdpath )
goto out_free ;
bprm - > filename = bprm - > fdpath ;
}
bprm - > interp = bprm - > filename ;
2020-07-10 15:54:54 -05:00
retval = bprm_mm_init ( bprm ) ;
if ( retval )
goto out_free ;
2020-07-10 15:39:45 -05:00
return bprm ;
2020-07-11 08:16:15 -05:00
out_free :
free_bprm ( bprm ) ;
out :
return ERR_PTR ( retval ) ;
2020-07-10 15:39:45 -05:00
}
2017-10-03 16:15:42 -07:00
int bprm_change_interp ( const char * interp , struct linux_binprm * bprm )
exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
For a proof of concept, see DoTest.sh from:
http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 15:05:16 -08:00
{
/* If a binfmt changed the interp, free it first. */
if ( bprm - > interp ! = bprm - > filename )
kfree ( bprm - > interp ) ;
bprm - > interp = kstrdup ( interp , GFP_KERNEL ) ;
if ( ! bprm - > interp )
return - ENOMEM ;
return 0 ;
}
EXPORT_SYMBOL ( bprm_change_interp ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
/*
* determine how safe it is to execute the proposed program
2010-10-27 15:34:08 -07:00
* - the caller must hold - > cred_guard_mutex to protect against
2014-06-05 00:23:17 -07:00
* PTRACE_ATTACH or seccomp thread - sync
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
*/
2014-01-23 15:55:50 -08:00
static void check_unsafe_exec ( struct linux_binprm * bprm )
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
{
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
struct task_struct * p = current , * t ;
2009-03-30 07:35:18 -04:00
unsigned n_fs ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
2017-01-23 17:26:31 +13:00
if ( p - > ptrace )
bprm - > unsafe | = LSM_UNSAFE_PTRACE ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-12 16:47:50 -05:00
/*
* This isn ' t strictly necessary , but it makes it harder for LSMs to
* mess up .
*/
2014-05-21 15:23:46 -07:00
if ( task_no_new_privs ( current ) )
Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-12 16:47:50 -05:00
bprm - > unsafe | = LSM_UNSAFE_NO_NEW_PRIVS ;
2014-01-23 15:55:49 -08:00
t = p ;
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
n_fs = 1 ;
2010-08-18 04:37:33 +10:00
spin_lock ( & p - > fs - > lock ) ;
2009-04-24 01:02:45 +02:00
rcu_read_lock ( ) ;
2014-01-23 15:55:49 -08:00
while_each_thread ( p , t ) {
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
if ( t - > fs = = p - > fs )
n_fs + + ;
}
2009-04-24 01:02:45 +02:00
rcu_read_unlock ( ) ;
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
2014-01-23 15:55:50 -08:00
if ( p - > fs - > users > n_fs )
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
bprm - > unsafe | = LSM_UNSAFE_SHARE ;
2014-01-23 15:55:50 -08:00
else
p - > fs - > in_exec = 1 ;
2010-08-18 04:37:33 +10:00
spin_unlock ( & p - > fs - > lock ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
}
2020-05-29 22:00:54 -05:00
static void bprm_fill_uid ( struct linux_binprm * bprm , struct file * file )
2015-04-19 02:48:39 +02:00
{
2020-05-29 22:00:54 -05:00
/* Handle suid and sgid on files */
2021-01-21 14:19:42 +01:00
struct user_namespace * mnt_userns ;
2022-08-20 11:46:10 -04:00
struct inode * inode = file_inode ( file ) ;
2015-04-19 02:48:39 +02:00
unsigned int mode ;
kuid_t uid ;
kgid_t gid ;
2020-05-29 22:00:54 -05:00
if ( ! mnt_may_suid ( file - > f_path . mnt ) )
2015-04-19 02:48:39 +02:00
return ;
if ( task_no_new_privs ( current ) )
return ;
mode = READ_ONCE ( inode - > i_mode ) ;
if ( ! ( mode & ( S_ISUID | S_ISGID ) ) )
return ;
2021-01-21 14:19:42 +01:00
mnt_userns = file_mnt_user_ns ( file ) ;
2015-04-19 02:48:39 +02:00
/* Be careful if suid/sgid is set */
2016-01-22 15:40:57 -05:00
inode_lock ( inode ) ;
2015-04-19 02:48:39 +02:00
/* reload atomically mode/uid/gid now that lock held */
mode = inode - > i_mode ;
2021-01-21 14:19:42 +01:00
uid = i_uid_into_mnt ( mnt_userns , inode ) ;
gid = i_gid_into_mnt ( mnt_userns , inode ) ;
2016-01-22 15:40:57 -05:00
inode_unlock ( inode ) ;
2015-04-19 02:48:39 +02:00
/* We ignore suid/sgid if there are no mappings for them in the ns */
if ( ! kuid_has_mapping ( bprm - > cred - > user_ns , uid ) | |
! kgid_has_mapping ( bprm - > cred - > user_ns , gid ) )
return ;
if ( mode & S_ISUID ) {
bprm - > per_clear | = PER_CLEAR_ON_SETID ;
bprm - > cred - > euid = uid ;
}
if ( ( mode & ( S_ISGID | S_IXGRP ) ) = = ( S_ISGID | S_IXGRP ) ) {
bprm - > per_clear | = PER_CLEAR_ON_SETID ;
bprm - > cred - > egid = gid ;
}
}
2020-05-29 22:00:54 -05:00
/*
* Compute brpm - > cred based upon the final binary .
*/
static int bprm_creds_from_file ( struct linux_binprm * bprm )
{
/* Compute creds based on which file? */
struct file * file = bprm - > execfd_creds ? bprm - > executable : bprm - > file ;
bprm_fill_uid ( bprm , file ) ;
return security_bprm_creds_from_file ( bprm , file ) ;
}
2014-01-23 15:55:50 -08:00
/*
* Fill the binprm structure from the inode .
2020-05-29 22:00:54 -05:00
* Read the first BINPRM_BUF_SIZE bytes
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
*
* This may be called multiple times for binary chains ( scripts for example ) .
2005-04-16 15:20:36 -07:00
*/
2020-05-13 22:25:20 -05:00
static int prepare_binprm ( struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
2017-09-01 17:39:13 +02:00
loff_t pos = 0 ;
2005-04-16 15:20:36 -07:00
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
memset ( bprm - > buf , 0 , BINPRM_BUF_SIZE ) ;
2017-09-01 17:39:13 +02:00
return kernel_read ( bprm - > file , bprm - > buf , BINPRM_BUF_SIZE , & pos ) ;
2005-04-16 15:20:36 -07:00
}
2007-05-08 00:25:16 -07:00
/*
* Arguments are ' \0 ' separated strings found at the location bprm - > p
* points to ; chop off the first by relocating brpm - > p to right after
* the first ' \0 ' encountered .
*/
2007-07-19 01:48:16 -07:00
int remove_arg_zero ( struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
2007-07-19 01:48:16 -07:00
int ret = 0 ;
unsigned long offset ;
char * kaddr ;
struct page * page ;
2007-05-08 00:25:16 -07:00
2007-07-19 01:48:16 -07:00
if ( ! bprm - > argc )
return 0 ;
2005-04-16 15:20:36 -07:00
2007-07-19 01:48:16 -07:00
do {
offset = bprm - > p & ~ PAGE_MASK ;
page = get_arg_page ( bprm , bprm - > p , 0 ) ;
if ( ! page ) {
ret = - EFAULT ;
goto out ;
}
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kaddr = kmap_local_page ( page ) ;
2007-05-08 00:25:16 -07:00
2007-07-19 01:48:16 -07:00
for ( ; offset < PAGE_SIZE & & kaddr [ offset ] ;
offset + + , bprm - > p + + )
;
2007-05-08 00:25:16 -07:00
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 20:28:56 +02:00
kunmap_local ( kaddr ) ;
2007-07-19 01:48:16 -07:00
put_arg_page ( page ) ;
} while ( offset = = PAGE_SIZE ) ;
2007-05-08 00:25:16 -07:00
2007-07-19 01:48:16 -07:00
bprm - > p + + ;
bprm - > argc - - ;
ret = 0 ;
2007-05-08 00:25:16 -07:00
2007-07-19 01:48:16 -07:00
out :
return ret ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( remove_arg_zero ) ;
2013-09-11 14:24:44 -07:00
# define printable(c) (((c)=='\t') || ((c)=='\n') || (0x20<=(c) && (c)<=0x7e))
2005-04-16 15:20:36 -07:00
/*
* cycle the list of binary formats handler , until one recognizes the image
*/
2020-05-18 18:43:20 -05:00
static int search_binary_handler ( struct linux_binprm * bprm )
2005-04-16 15:20:36 -07:00
{
2013-09-11 14:24:44 -07:00
bool need_retry = IS_ENABLED ( CONFIG_MODULES ) ;
2005-04-16 15:20:36 -07:00
struct linux_binfmt * fmt ;
2013-09-11 14:24:44 -07:00
int retval ;
2005-04-16 15:20:36 -07:00
2020-05-13 22:25:20 -05:00
retval = prepare_binprm ( bprm ) ;
if ( retval < 0 )
return retval ;
2012-12-17 16:03:20 -08:00
2005-04-16 15:20:36 -07:00
retval = security_bprm_check ( bprm ) ;
if ( retval )
return retval ;
retval = - ENOENT ;
2013-09-11 14:24:44 -07:00
retry :
read_lock ( & binfmt_lock ) ;
list_for_each_entry ( fmt , & formats , lh ) {
if ( ! try_module_get ( fmt - > module ) )
continue ;
read_unlock ( & binfmt_lock ) ;
2019-05-14 15:44:37 -07:00
2013-09-11 14:24:44 -07:00
retval = fmt - > load_binary ( bprm ) ;
2019-05-14 15:44:37 -07:00
2014-05-04 20:11:36 -04:00
read_lock ( & binfmt_lock ) ;
put_binfmt ( fmt ) ;
2020-05-18 18:43:20 -05:00
if ( bprm - > point_of_no_return | | ( retval ! = - ENOEXEC ) ) {
2014-05-04 20:11:36 -04:00
read_unlock ( & binfmt_lock ) ;
2013-09-11 14:24:44 -07:00
return retval ;
2005-04-16 15:20:36 -07:00
}
}
2013-09-11 14:24:44 -07:00
read_unlock ( & binfmt_lock ) ;
2014-05-04 20:11:36 -04:00
if ( need_retry ) {
2013-09-11 14:24:44 -07:00
if ( printable ( bprm - > buf [ 0 ] ) & & printable ( bprm - > buf [ 1 ] ) & &
printable ( bprm - > buf [ 2 ] ) & & printable ( bprm - > buf [ 3 ] ) )
return retval ;
2013-09-11 14:24:45 -07:00
if ( request_module ( " binfmt-%04x " , * ( ushort * ) ( bprm - > buf + 2 ) ) < 0 )
return retval ;
2013-09-11 14:24:44 -07:00
need_retry = false ;
goto retry ;
}
2005-04-16 15:20:36 -07:00
return retval ;
}
2013-09-11 14:24:38 -07:00
static int exec_binprm ( struct linux_binprm * bprm )
{
pid_t old_pid , old_vpid ;
2020-05-18 18:43:20 -05:00
int ret , depth ;
2013-09-11 14:24:38 -07:00
/* Need to fetch pid before load_binary changes it */
old_pid = current - > pid ;
rcu_read_lock ( ) ;
old_vpid = task_pid_nr_ns ( current , task_active_pid_ns ( current - > parent ) ) ;
rcu_read_unlock ( ) ;
2020-05-18 18:43:20 -05:00
/* This allows 4 levels of binfmt rewrites before failing hard. */
for ( depth = 0 ; ; depth + + ) {
struct file * exec ;
if ( depth > 5 )
return - ELOOP ;
ret = search_binary_handler ( bprm ) ;
if ( ret < 0 )
return ret ;
if ( ! bprm - > interpreter )
break ;
exec = bprm - > file ;
bprm - > file = bprm - > interpreter ;
bprm - > interpreter = NULL ;
allow_write_access ( exec ) ;
if ( unlikely ( bprm - > have_execfd ) ) {
if ( bprm - > executable ) {
fput ( exec ) ;
return - ENOEXEC ;
}
bprm - > executable = exec ;
} else
fput ( exec ) ;
2013-09-11 14:24:38 -07:00
}
2020-05-18 18:43:20 -05:00
audit_bprm ( bprm ) ;
trace_sched_process_exec ( current , old_pid , bprm ) ;
ptrace_event ( PTRACE_EVENT_EXEC , old_vpid ) ;
proc_exec_connector ( current ) ;
return 0 ;
2013-09-11 14:24:38 -07:00
}
2005-04-16 15:20:36 -07:00
/*
* sys_execve ( ) executes a new program .
*/
2020-07-12 07:17:50 -05:00
static int bprm_execve ( struct linux_binprm * bprm ,
int fd , struct filename * filename , int flags )
2005-04-16 15:20:36 -07:00
{
2020-06-25 13:56:40 -05:00
struct file * file ;
2005-04-16 15:20:36 -07:00
int retval ;
2020-09-13 13:09:39 -06:00
2009-09-05 11:17:13 -07:00
retval = prepare_bprm_creds ( bprm ) ;
if ( retval )
2020-11-20 17:14:18 -06:00
return retval ;
2009-03-30 07:20:30 -04:00
2014-01-23 15:55:50 -08:00
check_unsafe_exec ( bprm ) ;
2009-09-05 11:17:13 -07:00
current - > in_execve = 1 ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
2020-06-25 13:56:40 -05:00
file = do_open_execat ( fd , filename , flags ) ;
2005-04-16 15:20:36 -07:00
retval = PTR_ERR ( file ) ;
if ( IS_ERR ( file ) )
2009-03-30 07:20:30 -04:00
goto out_unmark ;
2005-04-16 15:20:36 -07:00
sched_exec ( ) ;
bprm - > file = file ;
2020-07-11 08:16:15 -05:00
/*
* Record that a name derived from an O_CLOEXEC fd will be
2020-11-20 17:14:18 -06:00
* inaccessible after exec . This allows the code in exec to
* choose to fail when the executable is not mmaped into the
* interpreter and an open file descriptor is not passed to
* the interpreter . This makes for a better user experience
* than having the interpreter start and then immediately fail
* when it finds the executable is inaccessible .
2020-07-11 08:16:15 -05:00
*/
2020-12-09 15:42:57 -06:00
if ( bprm - > fdpath & & get_close_on_exec ( fd ) )
2020-07-11 08:16:15 -05:00
bprm - > interp_flags | = BINPRM_FLAGS_PATH_INACCESSIBLE ;
2005-04-16 15:20:36 -07:00
2020-03-22 15:46:24 -05:00
/* Set the unchanging part of bprm->cred */
retval = security_bprm_creds_for_exec ( bprm ) ;
if ( retval )
2005-04-16 15:20:36 -07:00
goto out ;
2013-09-11 14:24:38 -07:00
retval = exec_binprm ( bprm ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
if ( retval < 0 )
goto out ;
2005-04-16 15:20:36 -07:00
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
/* execve succeeded */
2009-03-30 07:20:30 -04:00
current - > fs - > in_exec = 0 ;
2009-02-05 17:18:11 +09:00
current - > in_execve = 0 ;
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 08:43:54 -04:00
rseq_execve ( current ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
acct_update_integrals ( current ) ;
2019-07-16 17:20:45 +02:00
task_numa_free ( current , false ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
return retval ;
2005-04-16 15:20:36 -07:00
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
out :
2020-04-04 09:42:56 -05:00
/*
2021-02-24 12:00:48 -08:00
* If past the point of no return ensure the code never
2020-04-04 09:42:56 -05:00
* returns to the userspace process . Use an existing fatal
* signal if present otherwise terminate the process with
* SIGSEGV .
*/
if ( bprm - > point_of_no_return & & ! fatal_signal_pending ( current ) )
2021-10-25 10:50:57 -05:00
force_fatal_sig ( SIGSEGV ) ;
2005-04-16 15:20:36 -07:00
2009-03-30 07:20:30 -04:00
out_unmark :
2014-01-23 15:55:50 -08:00
current - > fs - > in_exec = 0 ;
2009-02-05 17:18:11 +09:00
current - > in_execve = 0 ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
2020-07-12 07:17:50 -05:00
return retval ;
}
static int do_execveat_common ( int fd , struct filename * filename ,
struct user_arg_ptr argv ,
struct user_arg_ptr envp ,
int flags )
{
struct linux_binprm * bprm ;
int retval ;
if ( IS_ERR ( filename ) )
return PTR_ERR ( filename ) ;
/*
* We move the actual failure in case of RLIMIT_NPROC excess from
* set * uid ( ) to execve ( ) because too many poorly written programs
* don ' t check setuid ( ) return code . Here we additionally recheck
* whether NPROC limit is still exceeded .
*/
if ( ( current - > flags & PF_NPROC_EXCEEDED ) & &
2022-05-18 19:17:30 +02:00
is_rlimit_overlimit ( current_ucounts ( ) , UCOUNT_RLIMIT_NPROC , rlimit ( RLIMIT_NPROC ) ) ) {
2020-07-12 07:17:50 -05:00
retval = - EAGAIN ;
goto out_ret ;
}
/* We're below the limit (still or again), so we don't want to make
* further execve ( ) calls fail . */
current - > flags & = ~ PF_NPROC_EXCEEDED ;
bprm = alloc_bprm ( fd , filename ) ;
if ( IS_ERR ( bprm ) ) {
retval = PTR_ERR ( bprm ) ;
goto out_ret ;
}
2020-07-12 08:23:54 -05:00
retval = count ( argv , MAX_ARG_STRINGS ) ;
2022-01-31 16:09:47 -08:00
if ( retval = = 0 )
pr_warn_once ( " process '%s' launched '%s' with NULL argv: empty string added \n " ,
current - > comm , bprm - > filename ) ;
2020-07-12 08:23:54 -05:00
if ( retval < 0 )
goto out_free ;
bprm - > argc = retval ;
retval = count ( envp , MAX_ARG_STRINGS ) ;
if ( retval < 0 )
goto out_free ;
bprm - > envc = retval ;
retval = bprm_stack_limits ( bprm ) ;
2020-07-12 07:17:50 -05:00
if ( retval < 0 )
goto out_free ;
retval = copy_string_kernel ( bprm - > filename , bprm ) ;
if ( retval < 0 )
goto out_free ;
bprm - > exec = bprm - > p ;
retval = copy_strings ( bprm - > envc , envp , bprm ) ;
if ( retval < 0 )
goto out_free ;
retval = copy_strings ( bprm - > argc , argv , bprm ) ;
if ( retval < 0 )
goto out_free ;
2022-01-31 16:09:47 -08:00
/*
* When argv is empty , add an empty string ( " " ) as argv [ 0 ] to
* ensure confused userspace programs that start processing
* from argv [ 1 ] won ' t end up walking envp . See also
* bprm_stack_limits ( ) .
*/
if ( bprm - > argc = = 0 ) {
retval = copy_string_kernel ( " " , bprm ) ;
if ( retval < 0 )
goto out_free ;
bprm - > argc = 1 ;
}
2020-07-12 07:17:50 -05:00
retval = bprm_execve ( bprm , fd , filename , flags ) ;
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:24 +11:00
out_free :
2008-05-10 16:38:25 -04:00
free_bprm ( bprm ) ;
2005-04-16 15:20:36 -07:00
out_ret :
2020-06-25 13:56:40 -05:00
putname ( filename ) ;
2005-04-16 15:20:36 -07:00
return retval ;
}
2020-07-13 12:06:48 -05:00
int kernel_execve ( const char * kernel_filename ,
const char * const * argv , const char * const * envp )
{
struct filename * filename ;
struct linux_binprm * bprm ;
int fd = AT_FDCWD ;
int retval ;
2022-04-11 14:15:54 -05:00
/* It is non-sense for kernel threads to call execve */
if ( WARN_ON_ONCE ( current - > flags & PF_KTHREAD ) )
2022-04-11 11:40:14 -05:00
return - EINVAL ;
2020-07-13 12:06:48 -05:00
filename = getname_kernel ( kernel_filename ) ;
if ( IS_ERR ( filename ) )
return PTR_ERR ( filename ) ;
bprm = alloc_bprm ( fd , filename ) ;
if ( IS_ERR ( bprm ) ) {
retval = PTR_ERR ( bprm ) ;
goto out_ret ;
}
retval = count_strings_kernel ( argv ) ;
2022-01-31 16:09:47 -08:00
if ( WARN_ON_ONCE ( retval = = 0 ) )
retval = - EINVAL ;
2020-07-13 12:06:48 -05:00
if ( retval < 0 )
goto out_free ;
bprm - > argc = retval ;
retval = count_strings_kernel ( envp ) ;
if ( retval < 0 )
goto out_free ;
bprm - > envc = retval ;
retval = bprm_stack_limits ( bprm ) ;
if ( retval < 0 )
goto out_free ;
retval = copy_string_kernel ( bprm - > filename , bprm ) ;
if ( retval < 0 )
goto out_free ;
bprm - > exec = bprm - > p ;
retval = copy_strings_kernel ( bprm - > envc , envp , bprm ) ;
if ( retval < 0 )
goto out_free ;
retval = copy_strings_kernel ( bprm - > argc , argv , bprm ) ;
if ( retval < 0 )
goto out_free ;
retval = bprm_execve ( bprm , fd , filename , 0 ) ;
out_free :
free_bprm ( bprm ) ;
out_ret :
putname ( filename ) ;
return retval ;
}
static int do_execve ( struct filename * filename ,
2011-03-06 18:02:37 +01:00
const char __user * const __user * __argv ,
2012-10-20 21:49:33 -04:00
const char __user * const __user * __envp )
2011-03-06 18:02:37 +01:00
{
2011-03-06 18:02:54 +01:00
struct user_arg_ptr argv = { . ptr . native = __argv } ;
struct user_arg_ptr envp = { . ptr . native = __envp } ;
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
return do_execveat_common ( AT_FDCWD , filename , argv , envp , 0 ) ;
}
2020-07-13 12:06:48 -05:00
static int do_execveat ( int fd , struct filename * filename ,
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
const char __user * const __user * __argv ,
const char __user * const __user * __envp ,
int flags )
{
struct user_arg_ptr argv = { . ptr . native = __argv } ;
struct user_arg_ptr envp = { . ptr . native = __envp } ;
return do_execveat_common ( fd , filename , argv , envp , flags ) ;
2011-03-06 18:02:54 +01:00
}
# ifdef CONFIG_COMPAT
2014-02-05 12:54:53 -08:00
static int compat_do_execve ( struct filename * filename ,
2012-09-30 13:38:55 -04:00
const compat_uptr_t __user * __argv ,
2012-10-20 21:46:25 -04:00
const compat_uptr_t __user * __envp )
2011-03-06 18:02:54 +01:00
{
struct user_arg_ptr argv = {
. is_compat = true ,
. ptr . compat = __argv ,
} ;
struct user_arg_ptr envp = {
. is_compat = true ,
. ptr . compat = __envp ,
} ;
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
return do_execveat_common ( AT_FDCWD , filename , argv , envp , 0 ) ;
}
static int compat_do_execveat ( int fd , struct filename * filename ,
const compat_uptr_t __user * __argv ,
const compat_uptr_t __user * __envp ,
int flags )
{
struct user_arg_ptr argv = {
. is_compat = true ,
. ptr . compat = __argv ,
} ;
struct user_arg_ptr envp = {
. is_compat = true ,
. ptr . compat = __envp ,
} ;
return do_execveat_common ( fd , filename , argv , envp , flags ) ;
2011-03-06 18:02:37 +01:00
}
2011-03-06 18:02:54 +01:00
# endif
2011-03-06 18:02:37 +01:00
2009-09-23 15:56:59 -07:00
void set_binfmt ( struct linux_binfmt * new )
2005-04-16 15:20:36 -07:00
{
2009-09-23 15:57:41 -07:00
struct mm_struct * mm = current - > mm ;
if ( mm - > binfmt )
module_put ( mm - > binfmt - > module ) ;
2005-04-16 15:20:36 -07:00
2009-09-23 15:57:41 -07:00
mm - > binfmt = new ;
2009-09-23 15:56:59 -07:00
if ( new )
__module_get ( new - > module ) ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( set_binfmt ) ;
2007-07-19 01:48:27 -07:00
/*
2014-01-23 15:55:32 -08:00
* set_dumpable stores three - value SUID_DUMP_ * into mm - > flags .
2007-07-19 01:48:27 -07:00
*/
void set_dumpable ( struct mm_struct * mm , int value )
{
2014-01-23 15:55:32 -08:00
if ( WARN_ON ( ( unsigned ) value > SUID_DUMP_ROOT ) )
return ;
2019-03-07 16:29:23 -08:00
set_mask_bits ( & mm - > flags , MMF_DUMPABLE_MASK , value ) ;
2007-07-19 01:48:27 -07:00
}
2012-09-30 13:38:55 -04:00
SYSCALL_DEFINE3 ( execve ,
const char __user * , filename ,
const char __user * const __user * , argv ,
const char __user * const __user * , envp )
{
2014-02-05 12:54:53 -08:00
return do_execve ( getname ( filename ) , argv , envp ) ;
2012-09-30 13:38:55 -04:00
}
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
SYSCALL_DEFINE5 ( execveat ,
int , fd , const char __user * , filename ,
const char __user * const __user * , argv ,
const char __user * const __user * , envp ,
int , flags )
{
return do_execveat ( fd ,
2021-07-08 13:34:42 +07:00
getname_uflags ( filename , flags ) ,
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
argv , envp , flags ) ;
}
2012-09-30 13:38:55 -04:00
# ifdef CONFIG_COMPAT
2014-03-04 10:53:50 +01:00
COMPAT_SYSCALL_DEFINE3 ( execve , const char __user * , filename ,
const compat_uptr_t __user * , argv ,
const compat_uptr_t __user * , envp )
2012-09-30 13:38:55 -04:00
{
2014-02-05 12:54:53 -08:00
return compat_do_execve ( getname ( filename ) , argv , envp ) ;
2012-09-30 13:38:55 -04:00
}
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
COMPAT_SYSCALL_DEFINE5 ( execveat , int , fd ,
const char __user * , filename ,
const compat_uptr_t __user * , argv ,
const compat_uptr_t __user * , envp ,
int , flags )
{
return compat_do_execveat ( fd ,
2021-07-08 13:34:42 +07:00
getname_uflags ( filename , flags ) ,
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12 16:57:29 -08:00
argv , envp , flags ) ;
}
2012-09-30 13:38:55 -04:00
# endif
2022-01-21 22:13:17 -08:00
# ifdef CONFIG_SYSCTL
static int proc_dointvec_minmax_coredump ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
int error = proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
if ( ! error )
validate_coredump_safety ( ) ;
return error ;
}
static struct ctl_table fs_exec_sysctls [ ] = {
{
. procname = " suid_dumpable " ,
. data = & suid_dumpable ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax_coredump ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_TWO ,
} ,
{ }
} ;
static int __init init_fs_exec_sysctls ( void )
{
register_sysctl_init ( " fs " , fs_exec_sysctls ) ;
return 0 ;
}
fs_initcall ( init_fs_exec_sysctls ) ;
# endif /* CONFIG_SYSCTL */