License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
// SPDX-License-Identifier: GPL-2.0
2005-04-17 02:20:36 +04:00
/*
* linux / fs / proc / base . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* proc base directory handling functions
*
* 1999 , Al Viro . Rewritten . Now it covers the whole per - process part .
* Instead of using magical inumbers to determine the kind of object
* we allocate and fill in - core inodes upon lookup . They don ' t even
* go into icache . We cache the reference to task_struct upon lookup too .
* Eventually it should become a filesystem in its own . We don ' t use the
* rest of procfs anymore .
2005-09-04 02:55:10 +04:00
*
*
* Changelog :
* 17 - Jan - 2005
* Allan Bezerra
* Bruna Moreira < bruna . moreira @ indt . org . br >
* Edjard Mota < edjard . mota @ indt . org . br >
* Ilias Biris < ilias . biris @ indt . org . br >
* Mauricio Lin < mauricio . lin @ indt . org . br >
*
* Embedded Linux Lab - 10L E Instituto Nokia de Tecnologia - INdT
*
* A new process specific entry ( smaps ) included in / proc . It shows the
* size of rss for each memory area . The maps entry lacks information
* about physical memory size ( rss ) for each mapped file , i . e . ,
* rss information for executables and library files .
* This additional information is useful for any tools that need to know
* about physical memory consumption for a process specific library .
*
* Changelog :
* 21 - Feb - 2005
* Embedded Linux Lab - 10L E Instituto Nokia de Tecnologia - INdT
* Pud inclusion in the page table walking .
*
* ChangeLog :
* 10 - Mar - 2005
* 10L E Instituto Nokia de Tecnologia - INdT :
* A better way to walks through the page table as suggested by Hugh Dickins .
*
* Simo Piiroinen < simo . piiroinen @ nokia . com > :
* Smaps information related to shared , private , clean and dirty pages .
*
* Paul Mundt < paul . mundt @ nokia . com > :
* Overall revision about smaps .
2005-04-17 02:20:36 +04:00
*/
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <linux/errno.h>
# include <linux/time.h>
# include <linux/proc_fs.h>
# include <linux/stat.h>
2008-07-27 19:29:15 +04:00
# include <linux/task_io_accounting_ops.h>
2005-04-17 02:20:36 +04:00
# include <linux/init.h>
2006-01-11 23:17:46 +03:00
# include <linux/capability.h>
2005-04-17 02:20:36 +04:00
# include <linux/file.h>
2008-04-24 15:44:08 +04:00
# include <linux/fdtable.h>
2019-03-12 09:31:18 +03:00
# include <linux/generic-radix-tree.h>
2005-04-17 02:20:36 +04:00
# include <linux/string.h>
# include <linux/seq_file.h>
# include <linux/namei.h>
2006-12-08 13:37:56 +03:00
# include <linux/mnt_namespace.h>
2005-04-17 02:20:36 +04:00
# include <linux/mm.h>
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
# include <linux/swap.h>
2005-09-10 00:04:14 +04:00
# include <linux/rcupdate.h>
2021-09-30 01:02:13 +03:00
# include <linux/kallsyms.h>
2008-11-10 11:26:08 +03:00
# include <linux/stacktrace.h>
2007-10-19 10:40:37 +04:00
# include <linux/resource.h>
2007-05-08 11:26:04 +04:00
# include <linux/module.h>
2005-04-17 02:20:36 +04:00
# include <linux/mount.h>
# include <linux/security.h>
# include <linux/ptrace.h>
2013-02-28 05:03:16 +04:00
# include <linux/printk.h>
2018-02-07 02:37:24 +03:00
# include <linux/cache.h>
2007-10-19 10:39:35 +04:00
# include <linux/cgroup.h>
2005-04-17 02:20:36 +04:00
# include <linux/cpuset.h>
# include <linux/audit.h>
2005-11-08 01:15:49 +03:00
# include <linux/poll.h>
2006-10-02 13:18:08 +04:00
# include <linux/nsproxy.h>
2006-10-20 10:28:32 +04:00
# include <linux/oom.h>
2007-07-19 12:48:28 +04:00
# include <linux/elf.h>
2007-10-19 10:40:03 +04:00
# include <linux/pid_namespace.h>
2011-11-17 12:11:58 +04:00
# include <linux/user_namespace.h>
2009-03-30 03:50:06 +04:00
# include <linux/fs_struct.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2017-02-08 20:51:29 +03:00
# include <linux/sched/autogroup.h>
2017-02-08 20:51:29 +03:00
# include <linux/sched/mm.h>
2017-02-08 20:51:30 +03:00
# include <linux/sched/coredump.h>
2017-02-08 20:51:35 +03:00
# include <linux/sched/debug.h>
2017-02-05 14:07:04 +03:00
# include <linux/sched/stat.h>
2013-03-11 13:12:45 +04:00
# include <linux/posix-timers.h>
2019-11-12 04:27:16 +03:00
# include <linux/time_namespace.h>
2020-01-15 12:28:51 +03:00
# include <linux/resctrl.h>
2021-09-08 05:57:35 +03:00
# include <linux/cn_proc.h>
2023-04-18 08:13:41 +03:00
# include <linux/ksm.h>
2012-01-11 03:08:09 +04:00
# include <trace/events/oom.h>
2005-04-17 02:20:36 +04:00
# include "internal.h"
2012-08-23 14:43:24 +04:00
# include "fd.h"
2005-04-17 02:20:36 +04:00
2018-02-07 02:36:59 +03:00
# include "../../lib/kstrtox.h"
2006-06-26 11:25:46 +04:00
/* NOTE:
* Implementing inode permission operations in / proc is almost
* certainly an error . Permission checks need to happen during
* each system call not at open time . The reason is that most of
* what we wish to check for permissions in / proc varies at runtime .
*
* The classic example of a problem is opening file descriptors
* in / proc for a task before it execs a suid executable .
*/
2018-02-07 02:37:24 +03:00
static u8 nlink_tid __ro_after_init ;
static u8 nlink_tgid __ro_after_init ;
2016-12-13 03:45:32 +03:00
2005-04-17 02:20:36 +04:00
struct pid_entry {
2014-08-09 01:21:33 +04:00
const char * name ;
2016-12-13 03:45:08 +03:00
unsigned int len ;
2011-07-24 11:36:29 +04:00
umode_t mode ;
2007-02-12 11:55:40 +03:00
const struct inode_operations * iop ;
2007-02-12 11:55:34 +03:00
const struct file_operations * fop ;
2006-10-02 13:17:07 +04:00
union proc_op op ;
2005-04-17 02:20:36 +04:00
} ;
2006-10-02 13:18:49 +04:00
# define NOD(NAME, MODE, IOP, FOP, OP) { \
2006-10-02 13:17:07 +04:00
. name = ( NAME ) , \
2007-05-08 11:26:15 +04:00
. len = sizeof ( NAME ) - 1 , \
2006-10-02 13:17:07 +04:00
. mode = MODE , \
. iop = IOP , \
. fop = FOP , \
. op = OP , \
}
2008-11-10 01:32:52 +03:00
# define DIR(NAME, MODE, iops, fops) \
NOD ( NAME , ( S_IFDIR | ( MODE ) ) , & iops , & fops , { } )
# define LNK(NAME, get_link) \
2006-10-02 13:18:49 +04:00
NOD ( NAME , ( S_IFLNK | S_IRWXUGO ) , \
2006-10-02 13:17:07 +04:00
& proc_pid_link_inode_operations , NULL , \
2008-11-10 01:32:52 +03:00
{ . proc_get_link = get_link } )
# define REG(NAME, MODE, fops) \
NOD ( NAME , ( S_IFREG | ( MODE ) ) , NULL , & fops , { } )
# define ONE(NAME, MODE, show) \
2018-09-22 03:16:59 +03:00
NOD ( NAME , ( S_IFREG | ( MODE ) ) , \
2008-02-08 15:18:30 +03:00
NULL , & proc_single_file_operations , \
2008-11-10 01:32:52 +03:00
{ . proc_show = show } )
2018-09-22 03:16:59 +03:00
# define ATTR(LSM, NAME, MODE) \
NOD ( NAME , ( S_IFREG | ( MODE ) ) , \
NULL , & proc_pid_attr_operations , \
{ . lsm = LSM } )
2005-04-17 02:20:36 +04:00
2008-06-06 09:46:53 +04:00
/*
* Count the number of hardlinks for the pid_entry table , excluding the .
* and . . links .
*/
2016-12-13 03:45:32 +03:00
static unsigned int __init pid_entry_nlink ( const struct pid_entry * entries ,
2008-06-06 09:46:53 +04:00
unsigned int n )
{
unsigned int i ;
unsigned int count ;
2016-12-13 03:45:32 +03:00
count = 2 ;
2008-06-06 09:46:53 +04:00
for ( i = 0 ; i < n ; + + i ) {
if ( S_ISDIR ( entries [ i ] . mode ) )
+ + count ;
}
return count ;
}
2010-08-10 13:41:36 +04:00
static int get_task_root ( struct task_struct * task , struct path * root )
2005-04-17 02:20:36 +04:00
{
2009-03-29 03:21:27 +04:00
int result = - ENOENT ;
2005-09-07 02:18:22 +04:00
task_lock ( task ) ;
2010-08-10 13:41:36 +04:00
if ( task - > fs ) {
get_fs_root ( task - > fs , root ) ;
2009-03-29 03:21:27 +04:00
result = 0 ;
}
2005-09-07 02:18:22 +04:00
task_unlock ( task ) ;
2009-03-29 03:21:27 +04:00
return result ;
2005-09-07 02:18:22 +04:00
}
2012-01-11 03:11:20 +04:00
static int proc_cwd_link ( struct dentry * dentry , struct path * path )
2005-09-07 02:18:22 +04:00
{
2015-03-18 01:25:59 +03:00
struct task_struct * task = get_proc_task ( d_inode ( dentry ) ) ;
2005-09-07 02:18:22 +04:00
int result = - ENOENT ;
2006-06-26 11:25:55 +04:00
if ( task ) {
2010-08-10 13:41:36 +04:00
task_lock ( task ) ;
if ( task - > fs ) {
get_fs_pwd ( task - > fs , path ) ;
result = 0 ;
}
task_unlock ( task ) ;
2006-06-26 11:25:55 +04:00
put_task_struct ( task ) ;
}
2005-04-17 02:20:36 +04:00
return result ;
}
2012-01-11 03:11:20 +04:00
static int proc_root_link ( struct dentry * dentry , struct path * path )
2005-04-17 02:20:36 +04:00
{
2015-03-18 01:25:59 +03:00
struct task_struct * task = get_proc_task ( d_inode ( dentry ) ) ;
2005-04-17 02:20:36 +04:00
int result = - ENOENT ;
2006-06-26 11:25:55 +04:00
if ( task ) {
2010-08-10 13:41:36 +04:00
result = get_task_root ( task , path ) ;
2006-06-26 11:25:55 +04:00
put_task_struct ( task ) ;
}
2005-04-17 02:20:36 +04:00
return result ;
}
2019-07-14 00:27:14 +03:00
/*
* If the user used setproctitle ( ) , we just get the string from
* user space at arg_start , and limit it to a maximum of one page .
*/
static ssize_t get_mm_proctitle ( struct mm_struct * mm , char __user * buf ,
size_t count , unsigned long pos ,
unsigned long arg_start )
{
char * page ;
int ret , got ;
if ( pos > = PAGE_SIZE )
return 0 ;
page = ( char * ) __get_free_page ( GFP_KERNEL ) ;
if ( ! page )
return - ENOMEM ;
ret = 0 ;
got = access_remote_vm ( mm , arg_start , page , PAGE_SIZE , FOLL_ANON ) ;
if ( got > 0 ) {
int len = strnlen ( page , got ) ;
/* Include the NUL character if it was found */
if ( len < got )
len + + ;
if ( len > pos ) {
len - = pos ;
if ( len > count )
len = count ;
len - = copy_to_user ( buf , page + pos , len ) ;
if ( ! len )
len = - EFAULT ;
ret = len ;
}
}
free_page ( ( unsigned long ) page ) ;
return ret ;
}
2018-05-17 23:04:17 +03:00
static ssize_t get_mm_cmdline ( struct mm_struct * mm , char __user * buf ,
2018-05-18 01:17:33 +03:00
size_t count , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2015-06-26 01:00:54 +03:00
unsigned long arg_start , arg_end , env_start , env_end ;
2018-05-18 01:17:33 +03:00
unsigned long pos , len ;
2019-07-14 00:27:14 +03:00
char * page , c ;
2015-06-26 01:00:54 +03:00
/* Check if process spawned far enough to have cmdline. */
2018-05-17 23:04:17 +03:00
if ( ! mm - > env_end )
return 0 ;
2015-06-26 01:00:54 +03:00
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 03:05:28 +03:00
spin_lock ( & mm - > arg_lock ) ;
2015-06-26 01:00:54 +03:00
arg_start = mm - > arg_start ;
arg_end = mm - > arg_end ;
env_start = mm - > env_start ;
env_end = mm - > env_end ;
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 03:05:28 +03:00
spin_unlock ( & mm - > arg_lock ) ;
2015-06-26 01:00:54 +03:00
2018-05-18 01:17:33 +03:00
if ( arg_start > = arg_end )
return 0 ;
2018-06-08 03:09:59 +03:00
2014-08-09 01:21:41 +04:00
/*
2019-07-14 00:27:14 +03:00
* We allow setproctitle ( ) to overwrite the argument
* strings , and overflow past the original end . But
* only when it overflows into the environment area .
2014-08-09 01:21:41 +04:00
*/
2019-07-14 00:27:14 +03:00
if ( env_start ! = arg_end | | env_end < env_start )
2018-05-18 01:17:33 +03:00
env_start = env_end = arg_end ;
2019-07-14 00:27:14 +03:00
len = env_end - arg_start ;
2018-06-20 03:47:20 +03:00
2018-05-18 01:17:33 +03:00
/* We're not going to care if "*ppos" has high bits set */
2019-07-14 00:27:14 +03:00
pos = * ppos ;
if ( pos > = len )
2018-05-18 01:17:33 +03:00
return 0 ;
2019-07-14 00:27:14 +03:00
if ( count > len - pos )
count = len - pos ;
if ( ! count )
return 0 ;
/*
* Magical special case : if the argv [ ] end byte is not
* zero , the user has overwritten it with setproctitle ( 3 ) .
*
* Possible future enhancement : do this only once when
* pos is 0 , and set a flag in the ' struct file ' .
*/
if ( access_remote_vm ( mm , arg_end - 1 , & c , 1 , FOLL_ANON ) = = 1 & & c )
return get_mm_proctitle ( mm , buf , count , pos , arg_start ) ;
2017-02-25 02:00:20 +03:00
2019-07-14 00:27:14 +03:00
/*
* For the non - setproctitle ( ) case we limit things strictly
* to the [ arg_start , arg_end [ range .
*/
pos + = arg_start ;
2019-07-13 23:40:13 +03:00
if ( pos < arg_start | | pos > = arg_end )
2018-05-18 01:17:33 +03:00
return 0 ;
2019-07-13 23:40:13 +03:00
if ( count > arg_end - pos )
count = arg_end - pos ;
2018-05-18 01:17:33 +03:00
page = ( char * ) __get_free_page ( GFP_KERNEL ) ;
if ( ! page )
return - ENOMEM ;
len = 0 ;
while ( count ) {
int got ;
size_t size = min_t ( size_t , PAGE_SIZE , count ) ;
2019-07-13 23:40:13 +03:00
got = access_remote_vm ( mm , pos , page , size , FOLL_ANON ) ;
if ( got < = 0 )
2018-05-18 01:17:33 +03:00
break ;
2019-07-13 23:40:13 +03:00
got - = copy_to_user ( buf , page , got ) ;
2018-05-18 01:17:33 +03:00
if ( unlikely ( ! got ) ) {
if ( ! len )
len = - EFAULT ;
break ;
2015-06-26 01:00:54 +03:00
}
2018-05-18 01:17:33 +03:00
pos + = got ;
buf + = got ;
len + = got ;
count - = got ;
2015-06-26 01:00:54 +03:00
}
free_page ( ( unsigned long ) page ) ;
2018-05-18 01:17:33 +03:00
return len ;
2005-04-17 02:20:36 +04:00
}
2018-05-17 23:04:17 +03:00
static ssize_t get_task_cmdline ( struct task_struct * tsk , char __user * buf ,
size_t count , loff_t * pos )
{
struct mm_struct * mm ;
ssize_t ret ;
mm = get_task_mm ( tsk ) ;
if ( ! mm )
return 0 ;
ret = get_mm_cmdline ( mm , buf , count , pos ) ;
2015-06-26 01:00:54 +03:00
mmput ( mm ) ;
2018-05-17 23:04:17 +03:00
return ret ;
}
static ssize_t proc_pid_cmdline_read ( struct file * file , char __user * buf ,
size_t count , loff_t * pos )
{
struct task_struct * tsk ;
ssize_t ret ;
BUG_ON ( * pos < 0 ) ;
tsk = get_proc_task ( file_inode ( file ) ) ;
if ( ! tsk )
return - ESRCH ;
ret = get_task_cmdline ( tsk , buf , count , pos ) ;
put_task_struct ( tsk ) ;
if ( ret > 0 )
* pos + = ret ;
return ret ;
2005-04-17 02:20:36 +04:00
}
2015-06-26 01:00:54 +03:00
static const struct file_operations proc_pid_cmdline_ops = {
. read = proc_pid_cmdline_read ,
. llseek = generic_file_llseek ,
} ;
2005-04-17 02:20:36 +04:00
# ifdef CONFIG_KALLSYMS
/*
* Provides a wchan file via kallsyms in a proper one - value - per - file format .
* Returns the resolved symbol . If that fails , simply return the address .
*/
2014-08-09 01:21:44 +04:00
static int proc_pid_wchan ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2005-04-17 02:20:36 +04:00
{
2007-05-08 11:28:41 +04:00
unsigned long wchan ;
2021-09-30 01:02:13 +03:00
char symname [ KSYM_NAME_LEN ] ;
2005-04-17 02:20:36 +04:00
2021-09-30 01:02:13 +03:00
if ( ! ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) )
goto print0 ;
2005-04-17 02:20:36 +04:00
2021-09-30 01:02:13 +03:00
wchan = get_wchan ( task ) ;
if ( wchan & & ! lookup_symbol_name ( wchan , symname ) ) {
seq_puts ( m , symname ) ;
return 0 ;
}
2015-04-16 02:18:17 +03:00
2021-09-30 01:02:13 +03:00
print0 :
seq_putc ( m , ' 0 ' ) ;
2015-04-16 02:18:17 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
# endif /* CONFIG_KALLSYMS */
2011-03-23 22:52:50 +03:00
static int lock_trace ( struct task_struct * task )
{
2020-12-03 23:12:00 +03:00
int err = down_read_killable ( & task - > signal - > exec_update_lock ) ;
2011-03-23 22:52:50 +03:00
if ( err )
return err ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
if ( ! ptrace_may_access ( task , PTRACE_MODE_ATTACH_FSCREDS ) ) {
2020-12-03 23:12:00 +03:00
up_read ( & task - > signal - > exec_update_lock ) ;
2011-03-23 22:52:50 +03:00
return - EPERM ;
}
return 0 ;
}
static void unlock_trace ( struct task_struct * task )
{
2020-12-03 23:12:00 +03:00
up_read ( & task - > signal - > exec_update_lock ) ;
2011-03-23 22:52:50 +03:00
}
2008-11-10 11:26:08 +03:00
# ifdef CONFIG_STACKTRACE
# define MAX_STACK_TRACE_DEPTH 64
static int proc_pid_stack ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
unsigned long * entries ;
2011-03-23 22:52:50 +03:00
int err ;
2008-11-10 11:26:08 +03:00
2018-10-06 01:51:58 +03:00
/*
* The ability to racily run the kernel stack unwinder on a running task
* and then observe the unwinder output is scary ; while it is useful for
* debugging kernel issues , it can also allow an attacker to leak kernel
* stack contents .
* Doing this in a manner that is at least safe from races would require
* some work to ensure that the remote task can not be scheduled ; and
* even then , this would still expose the unwinder as local attack
* surface .
* Therefore , this interface is restricted to root .
*/
if ( ! file_ns_capable ( m - > file , & init_user_ns , CAP_SYS_ADMIN ) )
return - EACCES ;
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 23:55:00 +03:00
entries = kmalloc_array ( MAX_STACK_TRACE_DEPTH , sizeof ( * entries ) ,
GFP_KERNEL ) ;
2008-11-10 11:26:08 +03:00
if ( ! entries )
return - ENOMEM ;
2011-03-23 22:52:50 +03:00
err = lock_trace ( task ) ;
if ( ! err ) {
2019-04-25 12:44:58 +03:00
unsigned int i , nr_entries ;
2018-06-08 03:10:17 +03:00
2019-04-25 12:44:58 +03:00
nr_entries = stack_trace_save_tsk ( task , entries ,
MAX_STACK_TRACE_DEPTH , 0 ) ;
2011-03-23 22:52:50 +03:00
2019-04-25 12:44:58 +03:00
for ( i = 0 ; i < nr_entries ; i + + ) {
2017-11-28 03:45:56 +03:00
seq_printf ( m , " [<0>] %pB \n " , ( void * ) entries [ i ] ) ;
2011-03-23 22:52:50 +03:00
}
2019-04-25 12:44:58 +03:00
2011-03-23 22:52:50 +03:00
unlock_trace ( task ) ;
2008-11-10 11:26:08 +03:00
}
kfree ( entries ) ;
2011-03-23 22:52:50 +03:00
return err ;
2008-11-10 11:26:08 +03:00
}
# endif
2015-06-30 12:06:03 +03:00
# ifdef CONFIG_SCHED_INFO
2005-04-17 02:20:36 +04:00
/*
* Provides / proc / PID / schedstat
*/
2014-08-09 01:21:46 +04:00
static int proc_pid_schedstat ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2005-04-17 02:20:36 +04:00
{
2015-06-30 12:06:03 +03:00
if ( unlikely ( ! sched_info_on ( ) ) )
2019-03-06 02:50:35 +03:00
seq_puts ( m , " 0 0 0 \n " ) ;
2015-06-30 12:06:03 +03:00
else
seq_printf ( m , " %llu %llu %lu \n " ,
2015-04-16 02:18:17 +03:00
( unsigned long long ) task - > se . sum_exec_runtime ,
( unsigned long long ) task - > sched_info . run_delay ,
task - > sched_info . pcount ) ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
# endif
2008-01-25 23:08:34 +03:00
# ifdef CONFIG_LATENCYTOP
static int lstats_show_proc ( struct seq_file * m , void * v )
{
int i ;
2008-02-21 03:53:29 +03:00
struct inode * inode = m - > private ;
struct task_struct * task = get_proc_task ( inode ) ;
2008-01-25 23:08:34 +03:00
2008-02-21 03:53:29 +03:00
if ( ! task )
return - ESRCH ;
seq_puts ( m , " Latency Top version : v0.1 \n " ) ;
2018-08-22 07:54:34 +03:00
for ( i = 0 ; i < LT_SAVECOUNT ; i + + ) {
2011-01-13 04:00:30 +03:00
struct latency_record * lr = & task - > latency_record [ i ] ;
if ( lr - > backtrace [ 0 ] ) {
2008-01-25 23:08:34 +03:00
int q ;
2011-01-13 04:00:30 +03:00
seq_printf ( m , " %i %li %li " ,
lr - > count , lr - > time , lr - > max ) ;
2008-01-25 23:08:34 +03:00
for ( q = 0 ; q < LT_BACKTRACEDEPTH ; q + + ) {
2011-01-13 04:00:30 +03:00
unsigned long bt = lr - > backtrace [ q ] ;
2019-04-10 13:28:08 +03:00
2011-01-13 04:00:30 +03:00
if ( ! bt )
2008-01-25 23:08:34 +03:00
break ;
2011-01-13 04:00:30 +03:00
seq_printf ( m , " %ps " , ( void * ) bt ) ;
2008-01-25 23:08:34 +03:00
}
2011-01-13 04:00:32 +03:00
seq_putc ( m , ' \n ' ) ;
2008-01-25 23:08:34 +03:00
}
}
2008-02-21 03:53:29 +03:00
put_task_struct ( task ) ;
2008-01-25 23:08:34 +03:00
return 0 ;
}
static int lstats_open ( struct inode * inode , struct file * file )
{
2008-02-21 03:53:29 +03:00
return single_open ( file , lstats_show_proc , inode ) ;
2008-02-14 21:27:00 +03:00
}
2008-01-25 23:08:34 +03:00
static ssize_t lstats_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * offs )
{
2013-01-24 02:07:38 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
2008-01-25 23:08:34 +03:00
2008-02-21 03:53:29 +03:00
if ( ! task )
return - ESRCH ;
2019-05-15 01:42:34 +03:00
clear_tsk_latency_tracing ( task ) ;
2008-02-21 03:53:29 +03:00
put_task_struct ( task ) ;
2008-01-25 23:08:34 +03:00
return count ;
}
static const struct file_operations proc_lstats_operations = {
. open = lstats_open ,
. read = seq_read ,
. write = lstats_write ,
. llseek = seq_lseek ,
2008-02-21 03:53:29 +03:00
. release = single_release ,
2008-01-25 23:08:34 +03:00
} ;
# endif
2014-08-09 01:21:48 +04:00
static int proc_oom_score ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2005-04-17 02:20:36 +04:00
{
2018-12-28 11:34:29 +03:00
unsigned long totalpages = totalram_pages ( ) + total_swap_pages ;
2010-04-01 17:13:57 +04:00
unsigned long points = 0 ;
mm, oom: make the calculation of oom badness more accurate
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:31:22 +03:00
long badness ;
badness = oom_badness ( task , totalpages ) ;
/*
* Special case OOM_SCORE_ADJ_MIN for all others scale the
* badness value into [ 0 , 2000 ] range which we have been
* exporting for a long time so userspace might depend on it .
*/
if ( badness ! = LONG_MIN )
points = ( 1000 + badness * 1000 / ( long ) totalpages ) * 2 / 3 ;
2005-04-17 02:20:36 +04:00
2015-04-16 02:18:17 +03:00
seq_printf ( m , " %lu \n " , points ) ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
2007-10-19 10:40:37 +04:00
struct limit_names {
2014-08-09 01:21:33 +04:00
const char * name ;
const char * unit ;
2007-10-19 10:40:37 +04:00
} ;
static const struct limit_names lnames [ RLIM_NLIMITS ] = {
2009-09-23 03:45:32 +04:00
[ RLIMIT_CPU ] = { " Max cpu time " , " seconds " } ,
2007-10-19 10:40:37 +04:00
[ RLIMIT_FSIZE ] = { " Max file size " , " bytes " } ,
[ RLIMIT_DATA ] = { " Max data size " , " bytes " } ,
[ RLIMIT_STACK ] = { " Max stack size " , " bytes " } ,
[ RLIMIT_CORE ] = { " Max core file size " , " bytes " } ,
[ RLIMIT_RSS ] = { " Max resident set " , " bytes " } ,
[ RLIMIT_NPROC ] = { " Max processes " , " processes " } ,
[ RLIMIT_NOFILE ] = { " Max open files " , " files " } ,
[ RLIMIT_MEMLOCK ] = { " Max locked memory " , " bytes " } ,
[ RLIMIT_AS ] = { " Max address space " , " bytes " } ,
[ RLIMIT_LOCKS ] = { " Max file locks " , " locks " } ,
[ RLIMIT_SIGPENDING ] = { " Max pending signals " , " signals " } ,
[ RLIMIT_MSGQUEUE ] = { " Max msgqueue size " , " bytes " } ,
[ RLIMIT_NICE ] = { " Max nice priority " , NULL } ,
[ RLIMIT_RTPRIO ] = { " Max realtime priority " , NULL } ,
2008-02-24 02:23:52 +03:00
[ RLIMIT_RTTIME ] = { " Max realtime timeout " , " us " } ,
2007-10-19 10:40:37 +04:00
} ;
/* Display limits for a process */
2014-08-09 01:21:37 +04:00
static int proc_pid_limits ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2007-10-19 10:40:37 +04:00
{
unsigned int i ;
unsigned long flags ;
struct rlimit rlim [ RLIM_NLIMITS ] ;
2008-10-05 00:51:15 +04:00
if ( ! lock_task_sighand ( task , & flags ) )
2007-10-19 10:40:37 +04:00
return 0 ;
memcpy ( rlim , task - > signal - > rlim , sizeof ( struct rlimit ) * RLIM_NLIMITS ) ;
unlock_task_sighand ( task , & flags ) ;
/*
* print the file header
*/
2019-01-04 02:26:09 +03:00
seq_puts ( m , " Limit "
" Soft Limit "
" Hard Limit "
" Units \n " ) ;
2007-10-19 10:40:37 +04:00
for ( i = 0 ; i < RLIM_NLIMITS ; i + + ) {
if ( rlim [ i ] . rlim_cur = = RLIM_INFINITY )
2014-08-09 01:21:37 +04:00
seq_printf ( m , " %-25s %-20s " ,
2015-04-16 02:18:17 +03:00
lnames [ i ] . name , " unlimited " ) ;
2007-10-19 10:40:37 +04:00
else
2014-08-09 01:21:37 +04:00
seq_printf ( m , " %-25s %-20lu " ,
2015-04-16 02:18:17 +03:00
lnames [ i ] . name , rlim [ i ] . rlim_cur ) ;
2007-10-19 10:40:37 +04:00
if ( rlim [ i ] . rlim_max = = RLIM_INFINITY )
2014-08-09 01:21:37 +04:00
seq_printf ( m , " %-20s " , " unlimited " ) ;
2007-10-19 10:40:37 +04:00
else
2014-08-09 01:21:37 +04:00
seq_printf ( m , " %-20lu " , rlim [ i ] . rlim_max ) ;
2007-10-19 10:40:37 +04:00
if ( lnames [ i ] . unit )
2014-08-09 01:21:37 +04:00
seq_printf ( m , " %-10s \n " , lnames [ i ] . unit ) ;
2007-10-19 10:40:37 +04:00
else
2014-08-09 01:21:37 +04:00
seq_putc ( m , ' \n ' ) ;
2007-10-19 10:40:37 +04:00
}
2014-08-09 01:21:37 +04:00
return 0 ;
2007-10-19 10:40:37 +04:00
}
2008-07-26 06:46:00 +04:00
# ifdef CONFIG_HAVE_ARCH_TRACEHOOK
2014-08-09 01:21:39 +04:00
static int proc_pid_syscall ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2008-07-26 06:46:00 +04:00
{
2016-11-08 00:26:35 +03:00
struct syscall_info info ;
u64 * args = & info . data . args [ 0 ] ;
2015-04-16 02:18:17 +03:00
int res ;
res = lock_trace ( task ) ;
2011-03-23 22:52:50 +03:00
if ( res )
return res ;
2008-07-26 06:46:00 +04:00
2016-11-08 00:26:35 +03:00
if ( task_current_syscall ( task , & info ) )
2014-08-09 01:21:39 +04:00
seq_puts ( m , " running \n " ) ;
2016-11-08 00:26:35 +03:00
else if ( info . data . nr < 0 )
seq_printf ( m , " %d 0x%llx 0x%llx \n " ,
info . data . nr , info . sp , info . data . instruction_pointer ) ;
2011-03-23 22:52:50 +03:00
else
2014-08-09 01:21:39 +04:00
seq_printf ( m ,
2016-11-08 00:26:35 +03:00
" %d 0x%llx 0x%llx 0x%llx 0x%llx 0x%llx 0x%llx 0x%llx 0x%llx \n " ,
info . data . nr ,
2008-07-26 06:46:00 +04:00
args [ 0 ] , args [ 1 ] , args [ 2 ] , args [ 3 ] , args [ 4 ] , args [ 5 ] ,
2016-11-08 00:26:35 +03:00
info . sp , info . data . instruction_pointer ) ;
2011-03-23 22:52:50 +03:00
unlock_trace ( task ) ;
2015-04-16 02:18:17 +03:00
return 0 ;
2008-07-26 06:46:00 +04:00
}
# endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
2005-04-17 02:20:36 +04:00
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
/* permission checks */
2022-01-20 05:08:03 +03:00
static bool proc_fd_access_allowed ( struct inode * inode )
2005-04-17 02:20:36 +04:00
{
2006-06-26 11:25:58 +04:00
struct task_struct * task ;
2022-01-20 05:08:03 +03:00
bool allowed = false ;
2006-06-26 11:25:59 +04:00
/* Allow access to a task's file descriptors if it is us or we
* may use ptrace attach to the process and find out that
* information .
2006-06-26 11:25:58 +04:00
*/
task = get_proc_task ( inode ) ;
2006-06-26 11:25:59 +04:00
if ( task ) {
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
allowed = ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) ;
2006-06-26 11:25:58 +04:00
put_task_struct ( task ) ;
2006-06-26 11:25:59 +04:00
}
2006-06-26 11:25:58 +04:00
return allowed ;
2005-04-17 02:20:36 +04:00
}
2023-01-13 14:49:11 +03:00
int proc_setattr ( struct mnt_idmap * idmap , struct dentry * dentry ,
2021-01-21 16:19:43 +03:00
struct iattr * attr )
2006-07-15 23:26:45 +04:00
{
int error ;
2015-03-18 01:25:59 +03:00
struct inode * inode = d_inode ( dentry ) ;
2006-07-15 23:26:45 +04:00
if ( attr - > ia_valid & ATTR_MODE )
return - EPERM ;
2023-01-13 14:49:11 +03:00
error = setattr_prepare ( & nop_mnt_idmap , dentry , attr ) ;
2010-06-04 13:30:02 +04:00
if ( error )
return error ;
2023-01-13 14:49:11 +03:00
setattr_copy ( & nop_mnt_idmap , inode , attr ) ;
2010-06-04 13:30:02 +04:00
return 0 ;
2006-07-15 23:26:45 +04:00
}
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
/*
* May current process learn task ' s sched / cmdline info ( for hide_pid_min = 1 )
* or euid / egid ( for hide_pid_min = 2 ) ?
*/
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
static bool has_pid_permissions ( struct proc_fs_info * fs_info ,
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
struct task_struct * task ,
2020-04-19 17:10:57 +03:00
enum proc_hidepid hide_pid_min )
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
{
2020-04-19 17:10:53 +03:00
/*
* If ' hidpid ' mount option is set force a ptrace check ,
* we indicate that we are using a filesystem syscall
* by passing PTRACE_MODE_READ_FSCREDS
*/
if ( fs_info - > hide_pid = = HIDEPID_NOT_PTRACEABLE )
return ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
if ( fs_info - > hide_pid < hide_pid_min )
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
return true ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
if ( in_group_p ( fs_info - > pid_gid ) )
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
return true ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
return ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) ;
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
}
2023-01-13 14:49:22 +03:00
static int proc_pid_permission ( struct mnt_idmap * idmap ,
2021-01-21 16:19:43 +03:00
struct inode * inode , int mask )
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
{
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct proc_fs_info * fs_info = proc_sb_info ( inode - > i_sb ) ;
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
struct task_struct * task ;
bool has_perms ;
task = get_proc_task ( inode ) ;
2012-01-13 05:17:08 +04:00
if ( ! task )
return - ESRCH ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
has_perms = has_pid_permissions ( fs_info , task , HIDEPID_NO_ACCESS ) ;
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
put_task_struct ( task ) ;
if ( ! has_perms ) {
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
if ( fs_info - > hide_pid = = HIDEPID_INVISIBLE ) {
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
/*
* Let ' s make getdents ( ) , stat ( ) , and open ( )
* consistent with each other . If a process
* may not stat ( ) a file , it shouldn ' t be seen
* in procfs at all .
*/
return - ENOENT ;
}
return - EPERM ;
}
2023-01-13 14:49:22 +03:00
return generic_permission ( & nop_mnt_idmap , inode , mask ) ;
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
}
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_def_inode_operations = {
2006-07-15 23:26:45 +04:00
. setattr = proc_setattr ,
} ;
2008-02-08 15:18:30 +03:00
static int proc_single_show ( struct seq_file * m , void * v )
{
struct inode * inode = m - > private ;
2020-05-18 21:07:38 +03:00
struct pid_namespace * ns = proc_pid_ns ( inode - > i_sb ) ;
2018-05-16 08:19:01 +03:00
struct pid * pid = proc_pid ( inode ) ;
2008-02-08 15:18:30 +03:00
struct task_struct * task ;
int ret ;
task = get_pid_task ( pid , PIDTYPE_PID ) ;
if ( ! task )
return - ESRCH ;
ret = PROC_I ( inode ) - > op . proc_show ( m , ns , pid , task ) ;
put_task_struct ( task ) ;
return ret ;
}
static int proc_single_open ( struct inode * inode , struct file * filp )
{
2011-01-13 04:00:34 +03:00
return single_open ( filp , proc_single_show , inode ) ;
2008-02-08 15:18:30 +03:00
}
static const struct file_operations proc_single_file_operations = {
. open = proc_single_open ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
2014-10-10 02:25:24 +04:00
struct mm_struct * proc_mem_open ( struct inode * inode , unsigned int mode )
2005-04-17 02:20:36 +04:00
{
2014-10-10 02:25:24 +04:00
struct task_struct * task = get_proc_task ( inode ) ;
struct mm_struct * mm = ERR_PTR ( - ESRCH ) ;
2012-01-18 03:21:19 +04:00
2014-10-10 02:25:24 +04:00
if ( task ) {
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
mm = mm_access ( task , mode | PTRACE_MODE_FSCREDS ) ;
2014-10-10 02:25:24 +04:00
put_task_struct ( task ) ;
2012-01-18 03:21:19 +04:00
2014-10-10 02:25:24 +04:00
if ( ! IS_ERR_OR_NULL ( mm ) ) {
/* ensure this mm_struct can't be freed */
2017-02-28 01:30:07 +03:00
mmgrab ( mm ) ;
2014-10-10 02:25:24 +04:00
/* but do not pin its memory */
mmput ( mm ) ;
}
}
return mm ;
}
static int __mem_open ( struct inode * inode , struct file * file , unsigned int mode )
{
struct mm_struct * mm = proc_mem_open ( inode , mode ) ;
2012-01-18 03:21:19 +04:00
if ( IS_ERR ( mm ) )
return PTR_ERR ( mm ) ;
file - > private_data = mm ;
2005-04-17 02:20:36 +04:00
return 0 ;
}
2012-06-01 03:26:17 +04:00
static int mem_open ( struct inode * inode , struct file * file )
{
2012-07-31 01:42:28 +04:00
int ret = __mem_open ( inode , file , PTRACE_MODE_ATTACH ) ;
/* OK to pass negative loff_t, we can catch out-of-range */
file - > f_mode | = FMODE_UNSIGNED_OFFSET ;
return ret ;
2012-06-01 03:26:17 +04:00
}
2012-01-31 20:14:54 +04:00
static ssize_t mem_rw ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos , int write )
2005-04-17 02:20:36 +04:00
{
2012-01-18 03:21:19 +04:00
struct mm_struct * mm = file - > private_data ;
2012-01-31 20:14:54 +04:00
unsigned long addr = * ppos ;
ssize_t copied ;
2005-04-17 02:20:36 +04:00
char * page ;
2016-10-25 05:00:44 +03:00
unsigned int flags ;
2005-04-17 02:20:36 +04:00
2012-01-18 03:21:19 +04:00
if ( ! mm )
return 0 ;
2006-06-26 11:25:55 +04:00
2017-09-14 02:28:29 +03:00
page = ( char * ) __get_free_page ( GFP_KERNEL ) ;
2011-05-27 03:25:52 +04:00
if ( ! page )
2012-01-18 03:21:19 +04:00
return - ENOMEM ;
2005-04-17 02:20:36 +04:00
2006-09-29 13:01:02 +04:00
copied = 0 ;
2017-02-28 01:30:13 +03:00
if ( ! mmget_not_zero ( mm ) )
2012-01-31 20:15:11 +04:00
goto free ;
"Yes, people use FOLL_FORCE ;)"
This effectively reverts commit 8ee74a91ac30 ("proc: try to remove use
of FOLL_FORCE entirely")
It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:
Keno Fischer:
"We used these semantics as a hardening mechanism in the julia JIT. By
opening /proc/self/mem and using these semantics, we could avoid
needing RWX pages, or a dual mapping approach. We do have fallbacks to
these other methods (though getting EIO here actually causes an assert
in released versions - we'll updated that to make sure to take the
fall back in that case).
Nevertheless the /proc/self/mem approach was our favored approach
because it a) Required an attacker to be able to execute syscalls
which is a taller order than getting memory write and b) didn't double
the virtual address space requirements (as a dual mapping approach
would).
I think in general this feature is very useful for anybody who needs
to precisely control the execution of some other process. Various
debuggers (gdb/lldb/rr) certainly fall into that category, but there's
another class of such processes (wine, various emulators) which may
want to do that kind of thing.
Now, I suspect most of these will have the other process under ptrace
control, so maybe allowing (same_mm || ptraced) would be ok, but at
least for the sandbox/remote-jit use case, it would be perfectly
reasonable to not have the jit server be a ptracer"
Robert O'Callahan:
"We write to readonly code and data mappings via /proc/.../mem in lots
of different situations, particularly when we're adjusting program
state during replay to match the recorded execution.
Like Julia, we can add workarounds, but they could be expensive."
so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.
With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"
Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30 22:38:59 +03:00
flags = FOLL_FORCE | ( write ? FOLL_WRITE : 0 ) ;
2016-10-13 03:20:19 +03:00
2005-04-17 02:20:36 +04:00
while ( count > 0 ) {
2021-07-01 04:54:38 +03:00
size_t this_len = min_t ( size_t , count , PAGE_SIZE ) ;
2005-04-17 02:20:36 +04:00
2012-01-31 20:14:54 +04:00
if ( write & & copy_from_user ( page , buf , this_len ) ) {
2005-04-17 02:20:36 +04:00
copied = - EFAULT ;
break ;
}
2012-01-31 20:14:54 +04:00
2016-10-13 03:20:19 +03:00
this_len = access_remote_vm ( mm , addr , page , this_len , flags ) ;
2012-01-31 20:14:54 +04:00
if ( ! this_len ) {
2005-04-17 02:20:36 +04:00
if ( ! copied )
copied = - EIO ;
break ;
}
2012-01-31 20:14:54 +04:00
if ( ! write & & copy_to_user ( buf , page , this_len ) ) {
copied = - EFAULT ;
break ;
}
buf + = this_len ;
addr + = this_len ;
copied + = this_len ;
count - = this_len ;
2005-04-17 02:20:36 +04:00
}
2012-01-31 20:14:54 +04:00
* ppos = addr ;
2011-05-27 03:25:52 +04:00
2012-01-31 20:15:11 +04:00
mmput ( mm ) ;
free :
2011-05-27 03:25:52 +04:00
free_page ( ( unsigned long ) page ) ;
2005-04-17 02:20:36 +04:00
return copied ;
}
2012-01-31 20:14:54 +04:00
static ssize_t mem_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
return mem_rw ( file , buf , count , ppos , 0 ) ;
}
static ssize_t mem_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
return mem_rw ( file , ( char __user * ) buf , count , ppos , 1 ) ;
}
2008-02-05 09:29:04 +03:00
loff_t mem_lseek ( struct file * file , loff_t offset , int orig )
2005-04-17 02:20:36 +04:00
{
switch ( orig ) {
case 0 :
file - > f_pos = offset ;
break ;
case 1 :
file - > f_pos + = offset ;
break ;
default :
return - EINVAL ;
}
force_successful_syscall_return ( ) ;
return file - > f_pos ;
}
2012-01-18 03:21:19 +04:00
static int mem_release ( struct inode * inode , struct file * file )
{
struct mm_struct * mm = file - > private_data ;
2012-01-31 20:14:38 +04:00
if ( mm )
2012-01-31 20:15:11 +04:00
mmdrop ( mm ) ;
2012-01-18 03:21:19 +04:00
return 0 ;
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_mem_operations = {
2005-04-17 02:20:36 +04:00
. llseek = mem_lseek ,
. read = mem_read ,
. write = mem_write ,
. open = mem_open ,
2012-01-18 03:21:19 +04:00
. release = mem_release ,
2005-04-17 02:20:36 +04:00
} ;
2012-06-01 03:26:17 +04:00
static int environ_open ( struct inode * inode , struct file * file )
{
return __mem_open ( inode , file , PTRACE_MODE_READ ) ;
}
2007-10-17 10:30:17 +04:00
static ssize_t environ_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
char * page ;
unsigned long src = * ppos ;
2012-06-01 03:26:17 +04:00
int ret = 0 ;
struct mm_struct * mm = file - > private_data ;
2016-01-21 02:01:05 +03:00
unsigned long env_start , env_end ;
2007-10-17 10:30:17 +04:00
2016-05-06 02:22:26 +03:00
/* Ensure the process spawned far enough to have an environment. */
if ( ! mm | | ! mm - > env_end )
2012-06-01 03:26:17 +04:00
return 0 ;
2007-10-17 10:30:17 +04:00
2017-09-14 02:28:29 +03:00
page = ( char * ) __get_free_page ( GFP_KERNEL ) ;
2007-10-17 10:30:17 +04:00
if ( ! page )
2012-06-01 03:26:17 +04:00
return - ENOMEM ;
2007-10-17 10:30:17 +04:00
2011-02-16 06:26:01 +03:00
ret = 0 ;
2017-02-28 01:30:13 +03:00
if ( ! mmget_not_zero ( mm ) )
2012-06-01 03:26:17 +04:00
goto free ;
2016-01-21 02:01:05 +03:00
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 03:05:28 +03:00
spin_lock ( & mm - > arg_lock ) ;
2016-01-21 02:01:05 +03:00
env_start = mm - > env_start ;
env_end = mm - > env_end ;
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 03:05:28 +03:00
spin_unlock ( & mm - > arg_lock ) ;
2016-01-21 02:01:05 +03:00
2007-10-17 10:30:17 +04:00
while ( count > 0 ) {
2012-07-31 01:42:26 +04:00
size_t this_len , max_len ;
int retval ;
2007-10-17 10:30:17 +04:00
2016-01-21 02:01:05 +03:00
if ( src > = ( env_end - env_start ) )
2007-10-17 10:30:17 +04:00
break ;
2016-01-21 02:01:05 +03:00
this_len = env_end - ( env_start + src ) ;
2012-07-31 01:42:26 +04:00
max_len = min_t ( size_t , PAGE_SIZE , count ) ;
this_len = min ( max_len , this_len ) ;
2007-10-17 10:30:17 +04:00
2018-05-11 09:11:44 +03:00
retval = access_remote_vm ( mm , ( env_start + src ) , page , this_len , FOLL_ANON ) ;
2007-10-17 10:30:17 +04:00
if ( retval < = 0 ) {
ret = retval ;
break ;
}
if ( copy_to_user ( buf , page , retval ) ) {
ret = - EFAULT ;
break ;
}
ret + = retval ;
src + = retval ;
buf + = retval ;
count - = retval ;
}
* ppos = src ;
mmput ( mm ) ;
2012-06-01 03:26:17 +04:00
free :
2007-10-17 10:30:17 +04:00
free_page ( ( unsigned long ) page ) ;
return ret ;
}
static const struct file_operations proc_environ_operations = {
2012-06-01 03:26:17 +04:00
. open = environ_open ,
2007-10-17 10:30:17 +04:00
. read = environ_read ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2012-06-01 03:26:17 +04:00
. release = mem_release ,
2007-10-17 10:30:17 +04:00
} ;
2016-10-06 01:43:43 +03:00
static int auxv_open ( struct inode * inode , struct file * file )
{
return __mem_open ( inode , file , PTRACE_MODE_READ_FSCREDS ) ;
}
static ssize_t auxv_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
struct mm_struct * mm = file - > private_data ;
unsigned int nwords = 0 ;
2016-10-28 03:46:50 +03:00
if ( ! mm )
return 0 ;
2016-10-06 01:43:43 +03:00
do {
nwords + = 2 ;
} while ( mm - > saved_auxv [ nwords - 2 ] ! = 0 ) ; /* AT_NULL */
return simple_read_from_buffer ( buf , count , ppos , mm - > saved_auxv ,
nwords * sizeof ( mm - > saved_auxv [ 0 ] ) ) ;
}
static const struct file_operations proc_auxv_operations = {
. open = auxv_open ,
. read = auxv_read ,
. llseek = generic_file_llseek ,
. release = mem_release ,
} ;
2012-11-13 05:53:04 +04:00
static ssize_t oom_adj_read ( struct file * file , char __user * buf , size_t count ,
loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
2012-11-13 05:53:04 +04:00
char buffer [ PROC_NUMBUF ] ;
int oom_adj = OOM_ADJUST_MIN ;
size_t len ;
if ( ! task )
return - ESRCH ;
2016-07-29 01:44:37 +03:00
if ( task - > signal - > oom_score_adj = = OOM_SCORE_ADJ_MAX )
oom_adj = OOM_ADJUST_MAX ;
else
oom_adj = ( task - > signal - > oom_score_adj * - OOM_DISABLE ) /
OOM_SCORE_ADJ_MAX ;
2012-11-13 05:53:04 +04:00
put_task_struct ( task ) ;
2020-11-02 04:07:56 +03:00
if ( oom_adj > OOM_ADJUST_MAX )
oom_adj = OOM_ADJUST_MAX ;
2012-11-13 05:53:04 +04:00
len = snprintf ( buffer , sizeof ( buffer ) , " %d \n " , oom_adj ) ;
return simple_read_from_buffer ( buf , count , ppos , buffer , len ) ;
}
2016-07-29 01:44:40 +03:00
static int __set_oom_adj ( struct file * file , int oom_adj , bool legacy )
{
2016-07-29 01:44:43 +03:00
struct mm_struct * mm = NULL ;
2016-07-29 01:44:40 +03:00
struct task_struct * task ;
int err = 0 ;
task = get_proc_task ( file_inode ( file ) ) ;
if ( ! task )
return - ESRCH ;
mutex_lock ( & oom_adj_mutex ) ;
if ( legacy ) {
if ( oom_adj < task - > signal - > oom_score_adj & &
! capable ( CAP_SYS_RESOURCE ) ) {
err = - EACCES ;
goto err_unlock ;
}
/*
* / proc / pid / oom_adj is provided for legacy purposes , ask users to use
* / proc / pid / oom_score_adj instead .
*/
pr_warn_once ( " %s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead. \n " ,
current - > comm , task_pid_nr ( current ) , task_pid_nr ( task ) ,
task_pid_nr ( task ) ) ;
} else {
if ( ( short ) oom_adj < task - > signal - > oom_score_adj_min & &
! capable ( CAP_SYS_RESOURCE ) ) {
err = - EACCES ;
goto err_unlock ;
}
}
2016-07-29 01:44:43 +03:00
/*
* Make sure we will check other processes sharing the mm if this is
* not vfrok which wants its own oom_score_adj .
* pin the mm so it doesn ' t go away and get reused after task_unlock
*/
if ( ! task - > vfork_done ) {
struct task_struct * p = find_lock_task_mm ( task ) ;
if ( p ) {
mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm. This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.
Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important. Such operation can happen frequently. We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression. Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.
Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes. To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex. Its scope is changed to global.
The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork(). To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified. Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare. Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.
With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.
[surenb@google.com: v3]
Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14 02:58:35 +03:00
if ( test_bit ( MMF_MULTIPROCESS , & p - > mm - > flags ) ) {
2016-07-29 01:44:43 +03:00
mm = p - > mm ;
2017-02-28 01:30:07 +03:00
mmgrab ( mm ) ;
2016-07-29 01:44:43 +03:00
}
task_unlock ( p ) ;
}
}
2016-07-29 01:44:40 +03:00
task - > signal - > oom_score_adj = oom_adj ;
if ( ! legacy & & has_capability_noaudit ( current , CAP_SYS_RESOURCE ) )
task - > signal - > oom_score_adj_min = ( short ) oom_adj ;
trace_oom_score_adj_update ( task ) ;
2016-07-29 01:44:43 +03:00
if ( mm ) {
struct task_struct * p ;
rcu_read_lock ( ) ;
for_each_process ( p ) {
if ( same_thread_group ( task , p ) )
continue ;
/* do not touch kernel threads or the global init */
if ( p - > flags & PF_KTHREAD | | is_global_init ( p ) )
continue ;
task_lock ( p ) ;
if ( ! p - > vfork_done & & process_shares_mm ( p , mm ) ) {
p - > signal - > oom_score_adj = oom_adj ;
if ( ! legacy & & has_capability_noaudit ( current , CAP_SYS_RESOURCE ) )
p - > signal - > oom_score_adj_min = ( short ) oom_adj ;
}
task_unlock ( p ) ;
}
rcu_read_unlock ( ) ;
mmdrop ( mm ) ;
}
2016-07-29 01:44:40 +03:00
err_unlock :
mutex_unlock ( & oom_adj_mutex ) ;
put_task_struct ( task ) ;
return err ;
}
2016-07-29 01:44:37 +03:00
2015-11-06 05:50:32 +03:00
/*
* / proc / pid / oom_adj exists solely for backwards compatibility with previous
* kernels . The effective policy is defined by oom_score_adj , which has a
* different scale : oom_adj grew exponentially and oom_score_adj grows linearly .
* Values written to oom_adj are simply mapped linearly to oom_score_adj .
* Processes that become oom disabled via oom_adj will still be oom disabled
* with this implementation .
*
* oom_adj cannot be removed since existing userspace binaries use it .
*/
2012-11-13 05:53:04 +04:00
static ssize_t oom_adj_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
char buffer [ PROC_NUMBUF ] ;
int oom_adj ;
int err ;
memset ( buffer , 0 , sizeof ( buffer ) ) ;
if ( count > sizeof ( buffer ) - 1 )
count = sizeof ( buffer ) - 1 ;
if ( copy_from_user ( buffer , buf , count ) ) {
err = - EFAULT ;
goto out ;
}
err = kstrtoint ( strstrip ( buffer ) , 0 , & oom_adj ) ;
if ( err )
goto out ;
if ( ( oom_adj < OOM_ADJUST_MIN | | oom_adj > OOM_ADJUST_MAX ) & &
oom_adj ! = OOM_DISABLE ) {
err = - EINVAL ;
goto out ;
}
/*
* Scale / proc / pid / oom_score_adj appropriately ensuring that a maximum
* value is always attainable .
*/
if ( oom_adj = = OOM_ADJUST_MAX )
oom_adj = OOM_SCORE_ADJ_MAX ;
else
oom_adj = ( oom_adj * OOM_SCORE_ADJ_MAX ) / - OOM_DISABLE ;
2016-07-29 01:44:40 +03:00
err = __set_oom_adj ( file , oom_adj , true ) ;
2012-11-13 05:53:04 +04:00
out :
return err < 0 ? err : count ;
}
static const struct file_operations proc_oom_adj_operations = {
. read = oom_adj_read ,
. write = oom_adj_write ,
. llseek = generic_file_llseek ,
} ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
static ssize_t oom_score_adj_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
char buffer [ PROC_NUMBUF ] ;
2012-12-12 04:02:54 +04:00
short oom_score_adj = OOM_SCORE_ADJ_MIN ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
size_t len ;
if ( ! task )
return - ESRCH ;
2016-07-29 01:44:37 +03:00
oom_score_adj = task - > signal - > oom_score_adj ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
put_task_struct ( task ) ;
2012-12-12 04:02:54 +04:00
len = snprintf ( buffer , sizeof ( buffer ) , " %hd \n " , oom_score_adj ) ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
return simple_read_from_buffer ( buf , count , ppos , buffer , len ) ;
}
static ssize_t oom_score_adj_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
char buffer [ PROC_NUMBUF ] ;
2011-05-27 03:25:50 +04:00
int oom_score_adj ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
int err ;
memset ( buffer , 0 , sizeof ( buffer ) ) ;
if ( count > sizeof ( buffer ) - 1 )
count = sizeof ( buffer ) - 1 ;
2010-10-27 01:21:25 +04:00
if ( copy_from_user ( buffer , buf , count ) ) {
err = - EFAULT ;
goto out ;
}
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
2011-05-27 03:25:50 +04:00
err = kstrtoint ( strstrip ( buffer ) , 0 , & oom_score_adj ) ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
if ( err )
2010-10-27 01:21:25 +04:00
goto out ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
if ( oom_score_adj < OOM_SCORE_ADJ_MIN | |
2010-10-27 01:21:25 +04:00
oom_score_adj > OOM_SCORE_ADJ_MAX ) {
err = - EINVAL ;
goto out ;
}
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
2016-07-29 01:44:40 +03:00
err = __set_oom_adj ( file , oom_score_adj , false ) ;
2010-10-27 01:21:25 +04:00
out :
return err < 0 ? err : count ;
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
}
static const struct file_operations proc_oom_score_adj_operations = {
. read = oom_score_adj_read ,
. write = oom_score_adj_write ,
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 20:52:59 +04:00
. llseek = default_llseek ,
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
} ;
2019-01-23 01:06:39 +03:00
# ifdef CONFIG_AUDIT
2016-10-29 19:04:39 +03:00
# define TMPBUFLEN 11
2005-04-17 02:20:36 +04:00
static ssize_t proc_loginuid_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2006-06-26 11:25:55 +04:00
struct task_struct * task = get_proc_task ( inode ) ;
2005-04-17 02:20:36 +04:00
ssize_t length ;
char tmpbuf [ TMPBUFLEN ] ;
2006-06-26 11:25:55 +04:00
if ( ! task )
return - ESRCH ;
2005-04-17 02:20:36 +04:00
length = scnprintf ( tmpbuf , TMPBUFLEN , " %u " ,
2012-09-11 09:39:43 +04:00
from_kuid ( file - > f_cred - > user_ns ,
audit_get_loginuid ( task ) ) ) ;
2006-06-26 11:25:55 +04:00
put_task_struct ( task ) ;
2005-04-17 02:20:36 +04:00
return simple_read_from_buffer ( buf , count , ppos , tmpbuf , length ) ;
}
static ssize_t proc_loginuid_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2005-04-17 02:20:36 +04:00
uid_t loginuid ;
2012-09-11 09:39:43 +04:00
kuid_t kloginuid ;
2015-09-10 01:36:59 +03:00
int rv ;
2005-04-17 02:20:36 +04:00
2020-10-15 22:46:44 +03:00
/* Don't let kthreads write their own loginuid */
if ( current - > flags & PF_KTHREAD )
return - EPERM ;
2010-02-23 04:04:52 +03:00
rcu_read_lock ( ) ;
if ( current ! = pid_task ( proc_pid ( inode ) , PIDTYPE_PID ) ) {
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
return - EPERM ;
2010-02-23 04:04:52 +03:00
}
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
if ( * ppos ! = 0 ) {
/* No partial writes. */
return - EINVAL ;
}
2015-09-10 01:36:59 +03:00
rv = kstrtou32_from_user ( buf , count , 10 , & loginuid ) ;
if ( rv < 0 )
return rv ;
2013-05-24 17:49:14 +04:00
/* is userspace tring to explicitly UNSET the loginuid? */
if ( loginuid = = AUDIT_UID_UNSET ) {
kloginuid = INVALID_UID ;
} else {
kloginuid = make_kuid ( file - > f_cred - > user_ns , loginuid ) ;
2015-09-10 01:36:59 +03:00
if ( ! uid_valid ( kloginuid ) )
return - EINVAL ;
2012-09-11 09:39:43 +04:00
}
2015-09-10 01:36:59 +03:00
rv = audit_set_loginuid ( kloginuid ) ;
if ( rv < 0 )
return rv ;
return count ;
2005-04-17 02:20:36 +04:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_loginuid_operations = {
2005-04-17 02:20:36 +04:00
. read = proc_loginuid_read ,
. write = proc_loginuid_write ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2005-04-17 02:20:36 +04:00
} ;
2008-03-13 15:15:31 +03:00
static ssize_t proc_sessionid_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2008-03-13 15:15:31 +03:00
struct task_struct * task = get_proc_task ( inode ) ;
ssize_t length ;
char tmpbuf [ TMPBUFLEN ] ;
if ( ! task )
return - ESRCH ;
length = scnprintf ( tmpbuf , TMPBUFLEN , " %u " ,
audit_get_sessionid ( task ) ) ;
put_task_struct ( task ) ;
return simple_read_from_buffer ( buf , count , ppos , tmpbuf , length ) ;
}
static const struct file_operations proc_sessionid_operations = {
. read = proc_sessionid_read ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2008-03-13 15:15:31 +03:00
} ;
2005-04-17 02:20:36 +04:00
# endif
2006-12-08 13:39:47 +03:00
# ifdef CONFIG_FAULT_INJECTION
static ssize_t proc_fault_inject_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
2006-12-08 13:39:47 +03:00
char buffer [ PROC_NUMBUF ] ;
size_t len ;
int make_it_fail ;
if ( ! task )
return - ESRCH ;
make_it_fail = task - > make_it_fail ;
put_task_struct ( task ) ;
len = snprintf ( buffer , sizeof ( buffer ) , " %i \n " , make_it_fail ) ;
2007-05-08 11:31:41 +04:00
return simple_read_from_buffer ( buf , count , ppos , buffer , len ) ;
2006-12-08 13:39:47 +03:00
}
static ssize_t proc_fault_inject_write ( struct file * file ,
const char __user * buf , size_t count , loff_t * ppos )
{
struct task_struct * task ;
2015-09-10 01:36:59 +03:00
char buffer [ PROC_NUMBUF ] ;
2006-12-08 13:39:47 +03:00
int make_it_fail ;
2015-09-10 01:36:59 +03:00
int rv ;
2006-12-08 13:39:47 +03:00
if ( ! capable ( CAP_SYS_RESOURCE ) )
return - EPERM ;
memset ( buffer , 0 , sizeof ( buffer ) ) ;
if ( count > sizeof ( buffer ) - 1 )
count = sizeof ( buffer ) - 1 ;
if ( copy_from_user ( buffer , buf , count ) )
return - EFAULT ;
2015-09-10 01:36:59 +03:00
rv = kstrtoint ( strstrip ( buffer ) , 0 , & make_it_fail ) ;
if ( rv < 0 )
return rv ;
2014-04-08 02:39:15 +04:00
if ( make_it_fail < 0 | | make_it_fail > 1 )
return - EINVAL ;
2013-01-24 02:07:38 +04:00
task = get_proc_task ( file_inode ( file ) ) ;
2006-12-08 13:39:47 +03:00
if ( ! task )
return - ESRCH ;
task - > make_it_fail = make_it_fail ;
put_task_struct ( task ) ;
2009-09-23 03:45:38 +04:00
return count ;
2006-12-08 13:39:47 +03:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_fault_inject_operations = {
2006-12-08 13:39:47 +03:00
. read = proc_fault_inject_read ,
. write = proc_fault_inject_write ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2006-12-08 13:39:47 +03:00
} ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
static ssize_t proc_fail_nth_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
struct task_struct * task ;
2017-07-15 00:49:52 +03:00
int err ;
unsigned int n ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
2017-07-15 00:49:57 +03:00
err = kstrtouint_from_user ( buf , count , 0 , & n ) ;
if ( err )
return err ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
task = get_proc_task ( file_inode ( file ) ) ;
if ( ! task )
return - ESRCH ;
2018-02-07 02:36:55 +03:00
task - > fail_nth = n ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
put_task_struct ( task ) ;
2017-07-15 00:49:57 +03:00
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
return count ;
}
static ssize_t proc_fail_nth_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
struct task_struct * task ;
2017-07-15 00:49:54 +03:00
char numbuf [ PROC_NUMBUF ] ;
ssize_t len ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
task = get_proc_task ( file_inode ( file ) ) ;
if ( ! task )
return - ESRCH ;
2018-02-07 02:36:55 +03:00
len = snprintf ( numbuf , sizeof ( numbuf ) , " %u \n " , task - > fail_nth ) ;
2017-07-15 00:49:57 +03:00
put_task_struct ( task ) ;
2018-08-22 07:54:27 +03:00
return simple_read_from_buffer ( buf , count , ppos , numbuf , len ) ;
fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-13 00:34:35 +03:00
}
static const struct file_operations proc_fail_nth_operations = {
. read = proc_fail_nth_read ,
. write = proc_fail_nth_write ,
} ;
2006-12-08 13:39:47 +03:00
# endif
2008-01-25 23:08:34 +03:00
2007-07-09 20:52:00 +04:00
# ifdef CONFIG_SCHED_DEBUG
/*
* Print out various scheduling related per - task fields :
*/
static int sched_show ( struct seq_file * m , void * v )
{
struct inode * inode = m - > private ;
2020-05-18 21:07:38 +03:00
struct pid_namespace * ns = proc_pid_ns ( inode - > i_sb ) ;
2007-07-09 20:52:00 +04:00
struct task_struct * p ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2017-08-06 07:41:41 +03:00
proc_sched_show_task ( p , ns , m ) ;
2007-07-09 20:52:00 +04:00
put_task_struct ( p ) ;
return 0 ;
}
static ssize_t
sched_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * offset )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2007-07-09 20:52:00 +04:00
struct task_struct * p ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
proc_sched_set_task ( p ) ;
put_task_struct ( p ) ;
return count ;
}
static int sched_open ( struct inode * inode , struct file * filp )
{
2011-01-13 04:00:34 +03:00
return single_open ( filp , sched_show , inode ) ;
2007-07-09 20:52:00 +04:00
}
static const struct file_operations proc_pid_sched_operations = {
. open = sched_open ,
. read = seq_read ,
. write = sched_write ,
. llseek = seq_lseek ,
2007-07-31 11:38:50 +04:00
. release = single_release ,
2007-07-09 20:52:00 +04:00
} ;
# endif
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
# ifdef CONFIG_SCHED_AUTOGROUP
/*
* Print out autogroup related information :
*/
static int sched_autogroup_show ( struct seq_file * m , void * v )
{
struct inode * inode = m - > private ;
struct task_struct * p ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
proc_sched_autogroup_show_task ( p , m ) ;
put_task_struct ( p ) ;
return 0 ;
}
static ssize_t
sched_autogroup_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * offset )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
struct task_struct * p ;
char buffer [ PROC_NUMBUF ] ;
2011-05-27 03:25:50 +04:00
int nice ;
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
int err ;
memset ( buffer , 0 , sizeof ( buffer ) ) ;
if ( count > sizeof ( buffer ) - 1 )
count = sizeof ( buffer ) - 1 ;
if ( copy_from_user ( buffer , buf , count ) )
return - EFAULT ;
2011-05-27 03:25:50 +04:00
err = kstrtoint ( strstrip ( buffer ) , 0 , & nice ) ;
if ( err < 0 )
return err ;
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2012-02-23 12:41:27 +04:00
err = proc_sched_autogroup_set_nice ( p , nice ) ;
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
if ( err )
count = err ;
put_task_struct ( p ) ;
return count ;
}
static int sched_autogroup_open ( struct inode * inode , struct file * filp )
{
int ret ;
ret = single_open ( filp , sched_autogroup_show , NULL ) ;
if ( ! ret ) {
struct seq_file * m = filp - > private_data ;
m - > private = inode ;
}
return ret ;
}
static const struct file_operations proc_pid_sched_autogroup_operations = {
. open = sched_autogroup_open ,
. read = seq_read ,
. write = sched_autogroup_write ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
# endif /* CONFIG_SCHED_AUTOGROUP */
2019-11-12 04:27:16 +03:00
# ifdef CONFIG_TIME_NS
static int timens_offsets_show ( struct seq_file * m , void * v )
{
struct task_struct * p ;
p = get_proc_task ( file_inode ( m - > file ) ) ;
if ( ! p )
return - ESRCH ;
proc_timens_show_offsets ( p , m ) ;
put_task_struct ( p ) ;
return 0 ;
}
static ssize_t timens_offsets_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
struct inode * inode = file_inode ( file ) ;
struct proc_timens_offset offsets [ 2 ] ;
char * kbuf = NULL , * pos , * next_line ;
struct task_struct * p ;
int ret , noffsets ;
/* Only allow < page size writes at the beginning of the file */
if ( ( * ppos ! = 0 ) | | ( count > = PAGE_SIZE ) )
return - EINVAL ;
/* Slurp in the user data */
kbuf = memdup_user_nul ( buf , count ) ;
if ( IS_ERR ( kbuf ) )
return PTR_ERR ( kbuf ) ;
/* Parse the user data */
ret = - EINVAL ;
noffsets = 0 ;
for ( pos = kbuf ; pos ; pos = next_line ) {
struct proc_timens_offset * off = & offsets [ noffsets ] ;
2020-04-11 18:40:31 +03:00
char clock [ 10 ] ;
2019-11-12 04:27:16 +03:00
int err ;
/* Find the end of line and ensure we don't look past it */
next_line = strchr ( pos , ' \n ' ) ;
if ( next_line ) {
* next_line = ' \0 ' ;
next_line + + ;
if ( * next_line = = ' \0 ' )
next_line = NULL ;
}
2020-04-11 18:40:31 +03:00
err = sscanf ( pos , " %9s %lld %lu " , clock ,
2019-11-12 04:27:16 +03:00
& off - > val . tv_sec , & off - > val . tv_nsec ) ;
if ( err ! = 3 | | off - > val . tv_nsec > = NSEC_PER_SEC )
goto out ;
2020-04-11 18:40:31 +03:00
clock [ sizeof ( clock ) - 1 ] = 0 ;
if ( strcmp ( clock , " monotonic " ) = = 0 | |
strcmp ( clock , __stringify ( CLOCK_MONOTONIC ) ) = = 0 )
off - > clockid = CLOCK_MONOTONIC ;
else if ( strcmp ( clock , " boottime " ) = = 0 | |
strcmp ( clock , __stringify ( CLOCK_BOOTTIME ) ) = = 0 )
off - > clockid = CLOCK_BOOTTIME ;
else
goto out ;
2019-11-12 04:27:16 +03:00
noffsets + + ;
if ( noffsets = = ARRAY_SIZE ( offsets ) ) {
if ( next_line )
count = next_line - kbuf ;
break ;
}
}
ret = - ESRCH ;
p = get_proc_task ( inode ) ;
if ( ! p )
goto out ;
ret = proc_timens_set_offset ( file , p , offsets , noffsets ) ;
put_task_struct ( p ) ;
if ( ret )
goto out ;
ret = count ;
out :
kfree ( kbuf ) ;
return ret ;
}
static int timens_offsets_open ( struct inode * inode , struct file * filp )
{
return single_open ( filp , timens_offsets_show , inode ) ;
}
static const struct file_operations proc_timens_offsets_operations = {
. open = timens_offsets_open ,
. read = seq_read ,
. write = timens_offsets_write ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
# endif /* CONFIG_TIME_NS */
2009-12-15 05:00:05 +03:00
static ssize_t comm_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * offset )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2009-12-15 05:00:05 +03:00
struct task_struct * p ;
char buffer [ TASK_COMM_LEN ] ;
2013-05-01 02:28:18 +04:00
const size_t maxlen = sizeof ( buffer ) - 1 ;
2009-12-15 05:00:05 +03:00
memset ( buffer , 0 , sizeof ( buffer ) ) ;
2013-05-01 02:28:18 +04:00
if ( copy_from_user ( buffer , buf , count > maxlen ? maxlen : count ) )
2009-12-15 05:00:05 +03:00
return - EFAULT ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2021-09-08 05:57:35 +03:00
if ( same_thread_group ( current , p ) ) {
2009-12-15 05:00:05 +03:00
set_task_comm ( p , buffer ) ;
2021-09-08 05:57:35 +03:00
proc_comm_connector ( p ) ;
}
2009-12-15 05:00:05 +03:00
else
count = - EINVAL ;
put_task_struct ( p ) ;
return count ;
}
static int comm_show ( struct seq_file * m , void * v )
{
struct inode * inode = m - > private ;
struct task_struct * p ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2018-05-18 18:47:13 +03:00
proc_task_name ( m , p , false ) ;
seq_putc ( m , ' \n ' ) ;
2009-12-15 05:00:05 +03:00
put_task_struct ( p ) ;
return 0 ;
}
static int comm_open ( struct inode * inode , struct file * filp )
{
2011-01-13 04:00:34 +03:00
return single_open ( filp , comm_show , inode ) ;
2009-12-15 05:00:05 +03:00
}
static const struct file_operations proc_pid_set_comm_operations = {
. open = comm_open ,
. read = seq_read ,
. write = comm_write ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
2012-01-11 03:11:20 +04:00
static int proc_exe_link ( struct dentry * dentry , struct path * exe_path )
2008-04-29 12:01:36 +04:00
{
struct task_struct * task ;
struct file * exe_file ;
2015-03-18 01:25:59 +03:00
task = get_proc_task ( d_inode ( dentry ) ) ;
2008-04-29 12:01:36 +04:00
if ( ! task )
return - ENOENT ;
2016-08-23 17:20:38 +03:00
exe_file = get_task_exe_file ( task ) ;
2008-04-29 12:01:36 +04:00
put_task_struct ( task ) ;
if ( exe_file ) {
* exe_path = exe_file - > f_path ;
path_get ( & exe_file - > f_path ) ;
fput ( exe_file ) ;
return 0 ;
} else
return - ENOENT ;
}
2015-11-17 18:20:54 +03:00
static const char * proc_pid_get_link ( struct dentry * dentry ,
2015-12-29 23:58:39 +03:00
struct inode * inode ,
struct delayed_call * done )
2005-04-17 02:20:36 +04:00
{
2012-06-18 18:47:03 +04:00
struct path path ;
2005-04-17 02:20:36 +04:00
int error = - EACCES ;
2015-11-17 18:20:54 +03:00
if ( ! dentry )
return ERR_PTR ( - ECHILD ) ;
2006-06-26 11:25:58 +04:00
/* Are we allowed to snoop on the tasks file descriptors? */
if ( ! proc_fd_access_allowed ( inode ) )
2005-04-17 02:20:36 +04:00
goto out ;
2012-06-18 18:47:03 +04:00
error = PROC_I ( inode ) - > op . proc_get_link ( dentry , & path ) ;
if ( error )
goto out ;
2019-12-06 17:13:28 +03:00
error = nd_jump_link ( & path ) ;
2005-04-17 02:20:36 +04:00
out :
[PATCH] Fix up symlink function pointers
This fixes up the symlink functions for the calling convention change:
* afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
smbfs, sysvfs, ufs, xfs - prototype change for ->follow_link()
* befs, smbfs, xfs - same for ->put_link()
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-20 03:17:39 +04:00
return ERR_PTR ( error ) ;
2005-04-17 02:20:36 +04:00
}
2022-08-04 20:11:42 +03:00
static int do_proc_readlink ( const struct path * path , char __user * buffer , int buflen )
2005-04-17 02:20:36 +04:00
{
2022-03-24 02:05:20 +03:00
char * tmp = kmalloc ( PATH_MAX , GFP_KERNEL ) ;
2008-02-15 06:38:35 +03:00
char * pathname ;
2005-04-17 02:20:36 +04:00
int len ;
if ( ! tmp )
return - ENOMEM ;
2007-05-08 11:31:41 +04:00
2022-03-24 02:05:20 +03:00
pathname = d_path ( path , tmp , PATH_MAX ) ;
2008-02-15 06:38:35 +03:00
len = PTR_ERR ( pathname ) ;
if ( IS_ERR ( pathname ) )
2005-04-17 02:20:36 +04:00
goto out ;
2022-03-24 02:05:20 +03:00
len = tmp + PATH_MAX - 1 - pathname ;
2005-04-17 02:20:36 +04:00
if ( len > buflen )
len = buflen ;
2008-02-15 06:38:35 +03:00
if ( copy_to_user ( buffer , pathname , len ) )
2005-04-17 02:20:36 +04:00
len = - EFAULT ;
out :
2022-03-24 02:05:20 +03:00
kfree ( tmp ) ;
2005-04-17 02:20:36 +04:00
return len ;
}
static int proc_pid_readlink ( struct dentry * dentry , char __user * buffer , int buflen )
{
int error = - EACCES ;
2015-03-18 01:25:59 +03:00
struct inode * inode = d_inode ( dentry ) ;
2008-02-15 06:38:35 +03:00
struct path path ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:25:58 +04:00
/* Are we allowed to snoop on the tasks file descriptors? */
if ( ! proc_fd_access_allowed ( inode ) )
2005-04-17 02:20:36 +04:00
goto out ;
2012-01-11 03:11:20 +04:00
error = PROC_I ( inode ) - > op . proc_get_link ( dentry , & path ) ;
2005-04-17 02:20:36 +04:00
if ( error )
goto out ;
2008-02-15 06:38:35 +03:00
error = do_proc_readlink ( & path , buffer , buflen ) ;
path_put ( & path ) ;
2005-04-17 02:20:36 +04:00
out :
return error ;
}
2012-08-23 14:43:24 +04:00
const struct inode_operations proc_pid_link_inode_operations = {
2005-04-17 02:20:36 +04:00
. readlink = proc_pid_readlink ,
2015-11-17 18:20:54 +03:00
. get_link = proc_pid_get_link ,
2006-07-15 23:26:45 +04:00
. setattr = proc_setattr ,
2005-04-17 02:20:36 +04:00
} ;
2006-10-02 13:17:05 +04:00
/* building an inode */
2017-09-30 21:45:42 +03:00
void task_dump_owner ( struct task_struct * task , umode_t mode ,
2017-01-03 00:23:11 +03:00
kuid_t * ruid , kgid_t * rgid )
{
/* Depending on the state of dumpable compute who should own a
* proc file for a task .
*/
const struct cred * cred ;
kuid_t uid ;
kgid_t gid ;
2018-04-21 00:56:03 +03:00
if ( unlikely ( task - > flags & PF_KTHREAD ) ) {
* ruid = GLOBAL_ROOT_UID ;
* rgid = GLOBAL_ROOT_GID ;
return ;
}
2017-01-03 00:23:11 +03:00
/* Default to the tasks effective ownership */
rcu_read_lock ( ) ;
cred = __task_cred ( task ) ;
uid = cred - > euid ;
gid = cred - > egid ;
rcu_read_unlock ( ) ;
/*
* Before the / proc / pid / status file was created the only way to read
* the effective uid of a / process was to stat / proc / pid . Reading
* / proc / pid / status is slow enough that procps and other packages
* kept stating / proc / pid . To keep the rules in / proc simple I have
* made this apply to all per process world readable and executable
* directories .
*/
if ( mode ! = ( S_IFDIR | S_IRUGO | S_IXUGO ) ) {
struct mm_struct * mm ;
task_lock ( task ) ;
mm = task - > mm ;
/* Make non-dumpable tasks owned by some root */
if ( mm ) {
if ( get_dumpable ( mm ) ! = SUID_DUMP_USER ) {
struct user_namespace * user_ns = mm - > user_ns ;
uid = make_kuid ( user_ns , 0 ) ;
if ( ! uid_valid ( uid ) )
uid = GLOBAL_ROOT_UID ;
gid = make_kgid ( user_ns , 0 ) ;
if ( ! gid_valid ( gid ) )
gid = GLOBAL_ROOT_GID ;
}
} else {
uid = GLOBAL_ROOT_UID ;
gid = GLOBAL_ROOT_GID ;
}
task_unlock ( task ) ;
}
* ruid = uid ;
* rgid = gid ;
}
2020-02-20 03:22:26 +03:00
void proc_pid_evict_inode ( struct proc_inode * ei )
{
struct pid * pid = ei - > pid ;
if ( S_ISDIR ( ei - > vfs_inode . i_mode ) ) {
2020-04-07 17:43:04 +03:00
spin_lock ( & pid - > lock ) ;
2020-02-20 03:22:26 +03:00
hlist_del_init_rcu ( & ei - > sibling_inodes ) ;
2020-04-07 17:43:04 +03:00
spin_unlock ( & pid - > lock ) ;
2020-02-20 03:22:26 +03:00
}
put_pid ( pid ) ;
}
2022-07-13 16:00:29 +03:00
struct inode * proc_pid_make_inode ( struct super_block * sb ,
2016-11-11 00:18:28 +03:00
struct task_struct * task , umode_t mode )
2006-10-02 13:17:05 +04:00
{
struct inode * inode ;
struct proc_inode * ei ;
2020-02-20 03:22:26 +03:00
struct pid * pid ;
2005-04-17 02:20:36 +04:00
2006-10-02 13:17:05 +04:00
/* We need a new inode */
2005-04-17 02:20:36 +04:00
2006-10-02 13:17:05 +04:00
inode = new_inode ( sb ) ;
if ( ! inode )
goto out ;
/* Common stuff */
ei = PROC_I ( inode ) ;
2016-11-11 00:18:28 +03:00
inode - > i_mode = mode ;
2010-10-23 19:19:54 +04:00
inode - > i_ino = get_next_ino ( ) ;
2016-09-14 17:48:04 +03:00
inode - > i_mtime = inode - > i_atime = inode - > i_ctime = current_time ( inode ) ;
2006-10-02 13:17:05 +04:00
inode - > i_op = & proc_def_inode_operations ;
/*
* grab the reference to task .
*/
2020-02-20 03:22:26 +03:00
pid = get_task_pid ( task , PIDTYPE_PID ) ;
if ( ! pid )
2006-10-02 13:17:05 +04:00
goto out_unlock ;
2020-02-20 03:22:26 +03:00
/* Let the pid remember us for quick removal */
ei - > pid = pid ;
2017-01-03 00:23:11 +03:00
task_dump_owner ( task , 0 , & inode - > i_uid , & inode - > i_gid ) ;
2006-10-02 13:17:05 +04:00
security_task_to_inode ( task , inode ) ;
2005-04-17 02:20:36 +04:00
out :
2006-10-02 13:17:05 +04:00
return inode ;
out_unlock :
iput ( inode ) ;
return NULL ;
2005-04-17 02:20:36 +04:00
}
2022-07-13 16:00:29 +03:00
/*
* Generating an inode and adding it into @ pid - > inodes , so that task will
* invalidate inode ' s dentry before being released .
*
* This helper is used for creating dir - type entries under ' / proc ' and
* ' / proc / < tgid > / task ' . Other entries ( eg . fd , stat ) under ' / proc / < tgid > '
* can be released by invalidating ' / proc / < tgid > ' dentry .
* In theory , dentries under ' / proc / < tgid > / task ' can also be released by
* invalidating ' / proc / < tgid > ' dentry , we reserve it to handle single
* thread exiting situation : Any one of threads should invalidate its
* ' / proc / < tgid > / task / < pid > ' dentry before released .
*/
static struct inode * proc_pid_make_base_inode ( struct super_block * sb ,
struct task_struct * task , umode_t mode )
{
struct inode * inode ;
struct proc_inode * ei ;
struct pid * pid ;
inode = proc_pid_make_inode ( sb , task , mode ) ;
if ( ! inode )
return NULL ;
/* Let proc_flush_pid find this directory inode */
ei = PROC_I ( inode ) ;
pid = ei - > pid ;
spin_lock ( & pid - > lock ) ;
hlist_add_head_rcu ( & ei - > sibling_inodes , & pid - > inodes ) ;
spin_unlock ( & pid - > lock ) ;
return inode ;
}
2023-01-13 14:49:12 +03:00
int pid_getattr ( struct mnt_idmap * idmap , const struct path * path ,
2021-01-21 16:19:43 +03:00
struct kstat * stat , u32 request_mask , unsigned int query_flags )
2005-04-17 02:20:36 +04:00
{
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 19:46:22 +03:00
struct inode * inode = d_inode ( path - > dentry ) ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct proc_fs_info * fs_info = proc_sb_info ( inode - > i_sb ) ;
2006-10-02 13:17:05 +04:00
struct task_struct * task ;
2008-11-14 02:39:19 +03:00
2023-01-13 14:49:12 +03:00
generic_fillattr ( & nop_mnt_idmap , inode , stat ) ;
2005-04-17 02:20:36 +04:00
2012-02-09 20:48:21 +04:00
stat - > uid = GLOBAL_ROOT_UID ;
stat - > gid = GLOBAL_ROOT_GID ;
2018-06-08 03:10:07 +03:00
rcu_read_lock ( ) ;
2006-10-02 13:17:05 +04:00
task = pid_task ( proc_pid ( inode ) , PIDTYPE_PID ) ;
if ( task ) {
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
if ( ! has_pid_permissions ( fs_info , task , HIDEPID_INVISIBLE ) ) {
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
rcu_read_unlock ( ) ;
/*
* This doesn ' t prevent learning whether PID exists ,
* it only makes getattr ( ) consistent with readdir ( ) .
*/
return - ENOENT ;
}
2017-01-03 00:23:11 +03:00
task_dump_owner ( task , inode - > i_mode , & stat - > uid , & stat - > gid ) ;
2005-04-17 02:20:36 +04:00
}
2006-10-02 13:17:05 +04:00
rcu_read_unlock ( ) ;
2005-06-23 11:09:43 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
/* dentry stuff */
/*
2018-05-03 04:26:16 +03:00
* Set < pid > / . . . inode ownership ( can change due to setuid ( ) , etc . )
*/
void pid_update_inode ( struct task_struct * task , struct inode * inode )
{
task_dump_owner ( task , inode - > i_mode , & inode - > i_uid , & inode - > i_gid ) ;
inode - > i_mode & = ~ ( S_ISUID | S_ISGID ) ;
security_task_to_inode ( task , inode ) ;
}
/*
2005-04-17 02:20:36 +04:00
* Rewrite the inode ' s ownerships here because the owning task may have
* performed a setuid ( ) , etc .
2006-06-26 11:25:55 +04:00
*
2005-04-17 02:20:36 +04:00
*/
2018-05-03 04:26:16 +03:00
static int pid_revalidate ( struct dentry * dentry , unsigned int flags )
2005-04-17 02:20:36 +04:00
{
2011-01-07 09:49:57 +03:00
struct inode * inode ;
struct task_struct * task ;
proc: allow pid_revalidate() during LOOKUP_RCU
Problem Description:
When running running ~128 parallel instances of
TZ=/etc/localtime ps -fe >/dev/null
on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:
walk_component()
lookup_fast()
d_revalidate()
pid_revalidate() // returns -ECHILD
unlazy_child()
lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory. Thus there is huge
spinlock contention.
This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode. Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.
By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.
As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.
Procfs RCU-walk Analysis:
This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:
* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.
For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:
proc_ns_get_link (bails out)
proc_get_link (RCU safe)
proc_pid_get_link (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate (bails out)
proc_net_d_revalidate (RCU safe)
proc_sys_revalidate (bails out, also not under /proc/$pid)
tid_fd_revalidate (bails out)
proc_sys_permission (not under /proc/$pid)
The remainder of the functions require a bit more detail:
* proc_fd_permission: RCU safe. All of the body of this function is
under rcu_read_lock(), except generic_permission() which declares
itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
thus calls into the audit code (see note #1 below). The remainder is
just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
safe, although it does call into the "security_ptrace_access_check()"
hook, which looks safe under smack and selinux. Just the audit code is
of concern. Also uses get_task_struct() and put_task_struct(), see
note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
(see note #2 below).
Note #1:
Most of the concern of RCU safety has centered around the audit code.
However, since b17ec22fb339 ("selinux: slow_avc_audit has become
non-blocking"), it's safe to call this code under RCU. So all of the
above are safe by my estimation.
Note #2: get_task_struct() and put_task_struct():
The majority of get_task_struct() is under RCU read lock, and in any
case it is a simple increment. But put_task_struct() is complex, given
that it could at some point free the task struct, and this process has
many steps which I couldn't manually verify. However, several other
places call put_task_struct() under RCU, so it appears safe to use
here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
Patch description:
pid_revalidate() drops from RCU into REF lookup mode. When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).
Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.
Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:32:05 +03:00
int ret = 0 ;
2008-11-14 02:39:19 +03:00
proc: allow pid_revalidate() during LOOKUP_RCU
Problem Description:
When running running ~128 parallel instances of
TZ=/etc/localtime ps -fe >/dev/null
on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:
walk_component()
lookup_fast()
d_revalidate()
pid_revalidate() // returns -ECHILD
unlazy_child()
lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory. Thus there is huge
spinlock contention.
This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode. Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.
By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.
As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.
Procfs RCU-walk Analysis:
This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:
* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.
For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:
proc_ns_get_link (bails out)
proc_get_link (RCU safe)
proc_pid_get_link (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate (bails out)
proc_net_d_revalidate (RCU safe)
proc_sys_revalidate (bails out, also not under /proc/$pid)
tid_fd_revalidate (bails out)
proc_sys_permission (not under /proc/$pid)
The remainder of the functions require a bit more detail:
* proc_fd_permission: RCU safe. All of the body of this function is
under rcu_read_lock(), except generic_permission() which declares
itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
thus calls into the audit code (see note #1 below). The remainder is
just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
safe, although it does call into the "security_ptrace_access_check()"
hook, which looks safe under smack and selinux. Just the audit code is
of concern. Also uses get_task_struct() and put_task_struct(), see
note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
(see note #2 below).
Note #1:
Most of the concern of RCU safety has centered around the audit code.
However, since b17ec22fb339 ("selinux: slow_avc_audit has become
non-blocking"), it's safe to call this code under RCU. So all of the
above are safe by my estimation.
Note #2: get_task_struct() and put_task_struct():
The majority of get_task_struct() is under RCU read lock, and in any
case it is a simple increment. But put_task_struct() is complex, given
that it could at some point free the task struct, and this process has
many steps which I couldn't manually verify. However, several other
places call put_task_struct() under RCU, so it appears safe to use
here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
Patch description:
pid_revalidate() drops from RCU into REF lookup mode. When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).
Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.
Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:32:05 +03:00
rcu_read_lock ( ) ;
inode = d_inode_rcu ( dentry ) ;
if ( ! inode )
goto out ;
task = pid_task ( proc_pid ( inode ) , PIDTYPE_PID ) ;
2011-01-07 09:49:57 +03:00
2006-06-26 11:25:55 +04:00
if ( task ) {
2018-05-03 04:26:16 +03:00
pid_update_inode ( task , inode ) ;
proc: allow pid_revalidate() during LOOKUP_RCU
Problem Description:
When running running ~128 parallel instances of
TZ=/etc/localtime ps -fe >/dev/null
on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:
walk_component()
lookup_fast()
d_revalidate()
pid_revalidate() // returns -ECHILD
unlazy_child()
lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory. Thus there is huge
spinlock contention.
This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode. Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.
By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.
As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.
Procfs RCU-walk Analysis:
This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:
* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.
For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:
proc_ns_get_link (bails out)
proc_get_link (RCU safe)
proc_pid_get_link (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate (bails out)
proc_net_d_revalidate (RCU safe)
proc_sys_revalidate (bails out, also not under /proc/$pid)
tid_fd_revalidate (bails out)
proc_sys_permission (not under /proc/$pid)
The remainder of the functions require a bit more detail:
* proc_fd_permission: RCU safe. All of the body of this function is
under rcu_read_lock(), except generic_permission() which declares
itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
thus calls into the audit code (see note #1 below). The remainder is
just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
safe, although it does call into the "security_ptrace_access_check()"
hook, which looks safe under smack and selinux. Just the audit code is
of concern. Also uses get_task_struct() and put_task_struct(), see
note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
(see note #2 below).
Note #1:
Most of the concern of RCU safety has centered around the audit code.
However, since b17ec22fb339 ("selinux: slow_avc_audit has become
non-blocking"), it's safe to call this code under RCU. So all of the
above are safe by my estimation.
Note #2: get_task_struct() and put_task_struct():
The majority of get_task_struct() is under RCU read lock, and in any
case it is a simple increment. But put_task_struct() is complex, given
that it could at some point free the task struct, and this process has
many steps which I couldn't manually verify. However, several other
places call put_task_struct() under RCU, so it appears safe to use
here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
Patch description:
pid_revalidate() drops from RCU into REF lookup mode. When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).
Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.
Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:32:05 +03:00
ret = 1 ;
2005-04-17 02:20:36 +04:00
}
proc: allow pid_revalidate() during LOOKUP_RCU
Problem Description:
When running running ~128 parallel instances of
TZ=/etc/localtime ps -fe >/dev/null
on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:
walk_component()
lookup_fast()
d_revalidate()
pid_revalidate() // returns -ECHILD
unlazy_child()
lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory. Thus there is huge
spinlock contention.
This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode. Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.
By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.
As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.
Procfs RCU-walk Analysis:
This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:
* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.
For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:
proc_ns_get_link (bails out)
proc_get_link (RCU safe)
proc_pid_get_link (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate (bails out)
proc_net_d_revalidate (RCU safe)
proc_sys_revalidate (bails out, also not under /proc/$pid)
tid_fd_revalidate (bails out)
proc_sys_permission (not under /proc/$pid)
The remainder of the functions require a bit more detail:
* proc_fd_permission: RCU safe. All of the body of this function is
under rcu_read_lock(), except generic_permission() which declares
itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
thus calls into the audit code (see note #1 below). The remainder is
just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
safe, although it does call into the "security_ptrace_access_check()"
hook, which looks safe under smack and selinux. Just the audit code is
of concern. Also uses get_task_struct() and put_task_struct(), see
note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
(see note #2 below).
Note #1:
Most of the concern of RCU safety has centered around the audit code.
However, since b17ec22fb339 ("selinux: slow_avc_audit has become
non-blocking"), it's safe to call this code under RCU. So all of the
above are safe by my estimation.
Note #2: get_task_struct() and put_task_struct():
The majority of get_task_struct() is under RCU read lock, and in any
case it is a simple increment. But put_task_struct() is complex, given
that it could at some point free the task struct, and this process has
many steps which I couldn't manually verify. However, several other
places call put_task_struct() under RCU, so it appears safe to use
here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
Patch description:
pid_revalidate() drops from RCU into REF lookup mode. When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).
Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.
Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 05:32:05 +03:00
out :
rcu_read_unlock ( ) ;
return ret ;
2005-04-17 02:20:36 +04:00
}
2014-01-24 03:55:39 +04:00
static inline bool proc_inode_is_dead ( struct inode * inode )
{
return ! proc_pid ( inode ) - > tasks [ PIDTYPE_PID ] . first ;
}
2013-04-12 04:08:50 +04:00
int pid_delete_dentry ( const struct dentry * dentry )
{
/* Is the task we represent dead?
* If so , then don ' t put the dentry on the lru list ,
* kill it immediately .
*/
2015-03-18 01:25:59 +03:00
return proc_inode_is_dead ( d_inode ( dentry ) ) ;
2013-04-12 04:08:50 +04:00
}
2010-03-08 03:41:34 +03:00
const struct dentry_operations pid_dentry_operations =
2006-10-02 13:17:05 +04:00
{
. d_revalidate = pid_revalidate ,
. d_delete = pid_delete_dentry ,
} ;
/* Lookups */
2006-10-02 13:18:57 +04:00
/*
* Fill a directory entry .
*
* If possible create the dcache entry and derive our inode number and
* file type from dcache entry .
*
* Since all of the proc inode numbers are dynamically generated , the inode
2020-12-16 07:42:32 +03:00
* numbers do not exist until the inode is cache . This means creating
2006-10-02 13:18:57 +04:00
* the dcache entry in readdir is necessary to keep the inode numbers
* reported by readdir in sync with the inode numbers reported
* by stat .
*/
2013-05-16 20:07:31 +04:00
bool proc_fill_cache ( struct file * file , struct dir_context * ctx ,
2018-06-08 03:10:10 +03:00
const char * name , unsigned int len ,
2007-05-08 11:26:15 +04:00
instantiate_t instantiate , struct task_struct * task , const void * ptr )
2006-10-02 13:18:49 +04:00
{
2013-05-16 20:07:31 +04:00
struct dentry * child , * dir = file - > f_path . dentry ;
2013-06-15 11:33:10 +04:00
struct qstr qname = QSTR_INIT ( name , len ) ;
2006-10-02 13:18:49 +04:00
struct inode * inode ;
2018-05-03 16:21:05 +03:00
unsigned type = DT_UNKNOWN ;
ino_t ino = 1 ;
2006-10-02 13:18:49 +04:00
2013-06-15 11:33:10 +04:00
child = d_hash_and_lookup ( dir , & qname ) ;
2006-10-02 13:18:49 +04:00
if ( ! child ) {
2016-04-20 23:31:31 +03:00
DECLARE_WAIT_QUEUE_HEAD_ONSTACK ( wq ) ;
child = d_alloc_parallel ( dir , & qname , & wq ) ;
if ( IS_ERR ( child ) )
2013-06-15 11:33:10 +04:00
goto end_instantiate ;
2016-04-20 23:31:31 +03:00
if ( d_in_lookup ( child ) ) {
2018-05-03 16:21:05 +03:00
struct dentry * res ;
res = instantiate ( child , task , ptr ) ;
2016-04-20 23:31:31 +03:00
d_lookup_done ( child ) ;
2018-05-03 16:21:05 +03:00
if ( unlikely ( res ) ) {
dput ( child ) ;
child = res ;
2018-06-08 08:17:11 +03:00
if ( IS_ERR ( child ) )
goto end_instantiate ;
2016-04-20 23:31:31 +03:00
}
2006-10-02 13:18:49 +04:00
}
}
2015-03-18 01:25:59 +03:00
inode = d_inode ( child ) ;
2013-06-15 10:26:35 +04:00
ino = inode - > i_ino ;
type = inode - > i_mode > > 12 ;
2006-10-02 13:18:49 +04:00
dput ( child ) ;
2018-06-08 08:17:11 +03:00
end_instantiate :
2013-05-16 20:07:31 +04:00
return dir_emit ( ctx , name , len , ino , type ) ;
2006-10-02 13:18:49 +04:00
}
2012-01-11 03:11:23 +04:00
/*
* dname_to_vma_addr - maps a dentry name into two unsigned longs
* which represent vma start and end addresses .
*/
static int dname_to_vma_addr ( struct dentry * dentry ,
unsigned long * start , unsigned long * end )
{
2018-02-07 02:36:59 +03:00
const char * str = dentry - > d_name . name ;
unsigned long long sval , eval ;
unsigned int len ;
2018-04-11 02:41:14 +03:00
if ( str [ 0 ] = = ' 0 ' & & str [ 1 ] ! = ' - ' )
return - EINVAL ;
2018-02-07 02:36:59 +03:00
len = _parse_integer ( str , 16 , & sval ) ;
if ( len & KSTRTOX_OVERFLOW )
return - EINVAL ;
if ( sval ! = ( unsigned long ) sval )
return - EINVAL ;
str + = len ;
if ( * str ! = ' - ' )
2012-01-11 03:11:23 +04:00
return - EINVAL ;
2018-02-07 02:36:59 +03:00
str + + ;
2018-04-11 02:41:14 +03:00
if ( str [ 0 ] = = ' 0 ' & & str [ 1 ] )
return - EINVAL ;
2018-02-07 02:36:59 +03:00
len = _parse_integer ( str , 16 , & eval ) ;
if ( len & KSTRTOX_OVERFLOW )
return - EINVAL ;
if ( eval ! = ( unsigned long ) eval )
return - EINVAL ;
str + = len ;
if ( * str ! = ' \0 ' )
return - EINVAL ;
* start = sval ;
* end = eval ;
2012-01-11 03:11:23 +04:00
return 0 ;
}
2012-06-11 00:03:43 +04:00
static int map_files_d_revalidate ( struct dentry * dentry , unsigned int flags )
2012-01-11 03:11:23 +04:00
{
unsigned long vm_start , vm_end ;
bool exact_vma_exists = false ;
struct mm_struct * mm = NULL ;
struct task_struct * task ;
struct inode * inode ;
int status = 0 ;
2012-06-11 00:03:43 +04:00
if ( flags & LOOKUP_RCU )
2012-01-11 03:11:23 +04:00
return - ECHILD ;
2015-03-18 01:25:59 +03:00
inode = d_inode ( dentry ) ;
2012-01-11 03:11:23 +04:00
task = get_proc_task ( inode ) ;
if ( ! task )
goto out_notask ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
mm = mm_access ( task , PTRACE_MODE_READ_FSCREDS ) ;
2012-06-01 03:26:18 +04:00
if ( IS_ERR_OR_NULL ( mm ) )
2012-01-11 03:11:23 +04:00
goto out ;
if ( ! dname_to_vma_addr ( dentry , & vm_start , & vm_end ) ) {
2020-06-09 07:33:25 +03:00
status = mmap_read_lock_killable ( mm ) ;
2019-07-12 07:00:03 +03:00
if ( ! status ) {
exact_vma_exists = ! ! find_exact_vma ( mm , vm_start ,
vm_end ) ;
2020-06-09 07:33:25 +03:00
mmap_read_unlock ( mm ) ;
2019-07-12 07:00:03 +03:00
}
2012-01-11 03:11:23 +04:00
}
mmput ( mm ) ;
if ( exact_vma_exists ) {
2017-01-03 00:23:11 +03:00
task_dump_owner ( task , 0 , & inode - > i_uid , & inode - > i_gid ) ;
2012-01-11 03:11:23 +04:00
security_task_to_inode ( task , inode ) ;
status = 1 ;
}
out :
put_task_struct ( task ) ;
out_notask :
return status ;
}
static const struct dentry_operations tid_map_files_dentry_operations = {
. d_revalidate = map_files_d_revalidate ,
. d_delete = pid_delete_dentry ,
} ;
2015-11-17 18:20:54 +03:00
static int map_files_get_link ( struct dentry * dentry , struct path * path )
2012-01-11 03:11:23 +04:00
{
unsigned long vm_start , vm_end ;
struct vm_area_struct * vma ;
struct task_struct * task ;
struct mm_struct * mm ;
int rc ;
rc = - ENOENT ;
2015-03-18 01:25:59 +03:00
task = get_proc_task ( d_inode ( dentry ) ) ;
2012-01-11 03:11:23 +04:00
if ( ! task )
goto out ;
mm = get_task_mm ( task ) ;
put_task_struct ( task ) ;
if ( ! mm )
goto out ;
rc = dname_to_vma_addr ( dentry , & vm_start , & vm_end ) ;
if ( rc )
goto out_mmput ;
2020-06-09 07:33:25 +03:00
rc = mmap_read_lock_killable ( mm ) ;
2019-07-12 07:00:03 +03:00
if ( rc )
goto out_mmput ;
2014-03-11 02:49:45 +04:00
rc = - ENOENT ;
2012-01-11 03:11:23 +04:00
vma = find_exact_vma ( mm , vm_start , vm_end ) ;
if ( vma & & vma - > vm_file ) {
* path = vma - > vm_file - > f_path ;
path_get ( path ) ;
rc = 0 ;
}
2020-06-09 07:33:25 +03:00
mmap_read_unlock ( mm ) ;
2012-01-11 03:11:23 +04:00
out_mmput :
mmput ( mm ) ;
out :
return rc ;
}
struct map_files_info {
2018-02-07 02:37:06 +03:00
unsigned long start ;
unsigned long end ;
2012-08-27 22:55:26 +04:00
fmode_t mode ;
2012-01-11 03:11:23 +04:00
} ;
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
/*
2020-07-19 13:04:14 +03:00
* Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links , due
* to concerns about how the symlinks may be used to bypass permissions on
* ancestor directories in the path to the file in question .
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
*/
static const char *
2015-11-17 18:20:54 +03:00
proc_map_files_get_link ( struct dentry * dentry ,
2015-12-29 23:58:39 +03:00
struct inode * inode ,
struct delayed_call * done )
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
{
2020-07-19 13:04:14 +03:00
if ( ! checkpoint_restore_ns_capable ( & init_user_ns ) )
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
return ERR_PTR ( - EPERM ) ;
2015-12-29 23:58:39 +03:00
return proc_pid_get_link ( dentry , inode , done ) ;
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
}
/*
2015-11-17 18:20:54 +03:00
* Identical to proc_pid_link_inode_operations except for get_link ( )
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
*/
static const struct inode_operations proc_map_files_link_inode_operations = {
. readlink = proc_pid_readlink ,
2015-11-17 18:20:54 +03:00
. get_link = proc_map_files_get_link ,
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
. setattr = proc_setattr ,
} ;
2018-05-03 16:21:05 +03:00
static struct dentry *
proc_map_files_instantiate ( struct dentry * dentry ,
2012-01-11 03:11:23 +04:00
struct task_struct * task , const void * ptr )
{
2012-08-27 22:55:26 +04:00
fmode_t mode = ( fmode_t ) ( unsigned long ) ptr ;
2012-01-11 03:11:23 +04:00
struct proc_inode * ei ;
struct inode * inode ;
2018-05-03 16:21:05 +03:00
inode = proc_pid_make_inode ( dentry - > d_sb , task , S_IFLNK |
2016-11-11 00:18:28 +03:00
( ( mode & FMODE_READ ) ? S_IRUSR : 0 ) |
( ( mode & FMODE_WRITE ) ? S_IWUSR : 0 ) ) ;
2012-01-11 03:11:23 +04:00
if ( ! inode )
2018-05-03 16:21:05 +03:00
return ERR_PTR ( - ENOENT ) ;
2012-01-11 03:11:23 +04:00
ei = PROC_I ( inode ) ;
2015-11-17 18:20:54 +03:00
ei - > op . proc_get_link = map_files_get_link ;
2012-01-11 03:11:23 +04:00
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 01:35:54 +03:00
inode - > i_op = & proc_map_files_link_inode_operations ;
2012-01-11 03:11:23 +04:00
inode - > i_size = 64 ;
d_set_d_op ( dentry , & tid_map_files_dentry_operations ) ;
2018-05-03 16:21:05 +03:00
return d_splice_alias ( inode , dentry ) ;
2012-01-11 03:11:23 +04:00
}
static struct dentry * proc_map_files_lookup ( struct inode * dir ,
2012-06-11 01:13:09 +04:00
struct dentry * dentry , unsigned int flags )
2012-01-11 03:11:23 +04:00
{
unsigned long vm_start , vm_end ;
struct vm_area_struct * vma ;
struct task_struct * task ;
2018-05-03 16:21:05 +03:00
struct dentry * result ;
2012-01-11 03:11:23 +04:00
struct mm_struct * mm ;
2018-05-03 16:21:05 +03:00
result = ERR_PTR ( - ENOENT ) ;
2012-01-11 03:11:23 +04:00
task = get_proc_task ( dir ) ;
if ( ! task )
goto out ;
2018-05-03 16:21:05 +03:00
result = ERR_PTR ( - EACCES ) ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
if ( ! ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) )
2012-01-11 03:11:23 +04:00
goto out_put_task ;
2018-05-03 16:21:05 +03:00
result = ERR_PTR ( - ENOENT ) ;
2012-01-11 03:11:23 +04:00
if ( dname_to_vma_addr ( dentry , & vm_start , & vm_end ) )
2012-05-18 04:03:25 +04:00
goto out_put_task ;
2012-01-11 03:11:23 +04:00
mm = get_task_mm ( task ) ;
if ( ! mm )
2012-05-18 04:03:25 +04:00
goto out_put_task ;
2012-01-11 03:11:23 +04:00
2019-07-12 07:00:03 +03:00
result = ERR_PTR ( - EINTR ) ;
2020-06-09 07:33:25 +03:00
if ( mmap_read_lock_killable ( mm ) )
2019-07-12 07:00:03 +03:00
goto out_put_mm ;
result = ERR_PTR ( - ENOENT ) ;
2012-01-11 03:11:23 +04:00
vma = find_exact_vma ( mm , vm_start , vm_end ) ;
if ( ! vma )
goto out_no_vma ;
2012-11-27 04:29:42 +04:00
if ( vma - > vm_file )
2018-05-03 16:21:05 +03:00
result = proc_map_files_instantiate ( dentry , task ,
2012-11-27 04:29:42 +04:00
( void * ) ( unsigned long ) vma - > vm_file - > f_mode ) ;
2012-01-11 03:11:23 +04:00
out_no_vma :
2020-06-09 07:33:25 +03:00
mmap_read_unlock ( mm ) ;
2019-07-12 07:00:03 +03:00
out_put_mm :
2012-01-11 03:11:23 +04:00
mmput ( mm ) ;
out_put_task :
put_task_struct ( task ) ;
out :
2018-05-03 16:21:05 +03:00
return result ;
2012-01-11 03:11:23 +04:00
}
static const struct inode_operations proc_map_files_inode_operations = {
. lookup = proc_map_files_lookup ,
. permission = proc_fd_permission ,
. setattr = proc_setattr ,
} ;
static int
2013-05-16 20:07:31 +04:00
proc_map_files_readdir ( struct file * file , struct dir_context * ctx )
2012-01-11 03:11:23 +04:00
{
struct vm_area_struct * vma ;
struct task_struct * task ;
struct mm_struct * mm ;
2013-05-16 20:07:31 +04:00
unsigned long nr_files , pos , i ;
2019-03-12 09:31:18 +03:00
GENRADIX ( struct map_files_info ) fa ;
2013-05-16 20:07:31 +04:00
struct map_files_info * p ;
2012-01-11 03:11:23 +04:00
int ret ;
2022-09-06 22:48:56 +03:00
struct vma_iterator vmi ;
2012-01-11 03:11:23 +04:00
2019-03-12 09:31:18 +03:00
genradix_init ( & fa ) ;
2012-01-11 03:11:23 +04:00
ret = - ENOENT ;
2013-05-16 20:07:31 +04:00
task = get_proc_task ( file_inode ( file ) ) ;
2012-01-11 03:11:23 +04:00
if ( ! task )
goto out ;
ret = - EACCES ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
if ( ! ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) )
2012-01-11 03:11:23 +04:00
goto out_put_task ;
ret = 0 ;
2013-05-16 20:07:31 +04:00
if ( ! dir_emit_dots ( file , ctx ) )
goto out_put_task ;
2012-01-11 03:11:23 +04:00
2013-05-16 20:07:31 +04:00
mm = get_task_mm ( task ) ;
if ( ! mm )
goto out_put_task ;
2019-07-12 07:00:03 +03:00
2020-06-09 07:33:29 +03:00
ret = mmap_read_lock_killable ( mm ) ;
2019-07-12 07:00:03 +03:00
if ( ret ) {
mmput ( mm ) ;
goto out_put_task ;
}
2012-01-11 03:11:23 +04:00
2013-05-16 20:07:31 +04:00
nr_files = 0 ;
2012-01-11 03:11:23 +04:00
2013-05-16 20:07:31 +04:00
/*
* We need two passes here :
*
2020-06-09 07:33:54 +03:00
* 1 ) Collect vmas of mapped files with mmap_lock taken
* 2 ) Release mmap_lock and instantiate entries
2013-05-16 20:07:31 +04:00
*
* otherwise we get lockdep complained , since filldir ( )
2020-06-09 07:33:54 +03:00
* routine might require mmap_lock taken in might_fault ( ) .
2013-05-16 20:07:31 +04:00
*/
2012-01-11 03:11:23 +04:00
2022-09-06 22:48:56 +03:00
pos = 2 ;
vma_iter_init ( & vmi , mm , 0 ) ;
for_each_vma ( vmi , vma ) {
2019-03-12 09:31:18 +03:00
if ( ! vma - > vm_file )
continue ;
if ( + + pos < = ctx - > pos )
continue ;
2013-05-16 20:07:31 +04:00
2019-03-12 09:31:18 +03:00
p = genradix_ptr_alloc ( & fa , nr_files + + , GFP_KERNEL ) ;
if ( ! p ) {
2013-05-16 20:07:31 +04:00
ret = - ENOMEM ;
2020-06-09 07:33:29 +03:00
mmap_read_unlock ( mm ) ;
2013-05-16 20:07:31 +04:00
mmput ( mm ) ;
goto out_put_task ;
2012-01-11 03:11:23 +04:00
}
2013-05-16 20:07:31 +04:00
2019-03-12 09:31:18 +03:00
p - > start = vma - > vm_start ;
p - > end = vma - > vm_end ;
p - > mode = vma - > vm_file - > f_mode ;
2012-01-11 03:11:23 +04:00
}
2020-06-09 07:33:29 +03:00
mmap_read_unlock ( mm ) ;
2018-04-11 02:32:05 +03:00
mmput ( mm ) ;
2013-05-16 20:07:31 +04:00
for ( i = 0 ; i < nr_files ; i + + ) {
2018-02-07 02:37:06 +03:00
char buf [ 4 * sizeof ( long ) + 2 ] ; /* max: %lx-%lx\0 */
unsigned int len ;
2019-03-12 09:31:18 +03:00
p = genradix_ptr ( & fa , i ) ;
2018-02-07 02:37:06 +03:00
len = snprintf ( buf , sizeof ( buf ) , " %lx-%lx " , p - > start , p - > end ) ;
2013-05-16 20:07:31 +04:00
if ( ! proc_fill_cache ( file , ctx ,
2018-02-07 02:37:06 +03:00
buf , len ,
2013-05-16 20:07:31 +04:00
proc_map_files_instantiate ,
task ,
( void * ) ( unsigned long ) p - > mode ) )
break ;
ctx - > pos + + ;
2012-01-11 03:11:23 +04:00
}
out_put_task :
put_task_struct ( task ) ;
out :
2019-03-12 09:31:18 +03:00
genradix_free ( & fa ) ;
2012-01-11 03:11:23 +04:00
return ret ;
}
static const struct file_operations proc_map_files_operations = {
. read = generic_read_dir ,
2016-04-21 00:13:54 +03:00
. iterate_shared = proc_map_files_readdir ,
. llseek = generic_file_llseek ,
2012-01-11 03:11:23 +04:00
} ;
2017-01-21 08:09:08 +03:00
# if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_POSIX_TIMERS)
2013-03-11 13:12:45 +04:00
struct timers_private {
struct pid * pid ;
struct task_struct * task ;
struct sighand_struct * sighand ;
2013-03-11 13:13:08 +04:00
struct pid_namespace * ns ;
2013-03-11 13:12:45 +04:00
unsigned long flags ;
} ;
static void * timers_start ( struct seq_file * m , loff_t * pos )
{
struct timers_private * tp = m - > private ;
tp - > task = get_pid_task ( tp - > pid , PIDTYPE_PID ) ;
if ( ! tp - > task )
return ERR_PTR ( - ESRCH ) ;
tp - > sighand = lock_task_sighand ( tp - > task , & tp - > flags ) ;
if ( ! tp - > sighand )
return ERR_PTR ( - ESRCH ) ;
return seq_list_start ( & tp - > task - > signal - > posix_timers , * pos ) ;
}
static void * timers_next ( struct seq_file * m , void * v , loff_t * pos )
{
struct timers_private * tp = m - > private ;
return seq_list_next ( v , & tp - > task - > signal - > posix_timers , pos ) ;
}
static void timers_stop ( struct seq_file * m , void * v )
{
struct timers_private * tp = m - > private ;
if ( tp - > sighand ) {
unlock_task_sighand ( tp - > task , & tp - > flags ) ;
tp - > sighand = NULL ;
}
if ( tp - > task ) {
put_task_struct ( tp - > task ) ;
tp - > task = NULL ;
}
}
static int show_timer ( struct seq_file * m , void * v )
{
struct k_itimer * timer ;
2013-03-11 13:13:08 +04:00
struct timers_private * tp = m - > private ;
int notify ;
2014-08-09 01:21:33 +04:00
static const char * const nstr [ ] = {
2013-03-11 13:13:08 +04:00
[ SIGEV_SIGNAL ] = " signal " ,
[ SIGEV_NONE ] = " none " ,
[ SIGEV_THREAD ] = " thread " ,
} ;
2013-03-11 13:12:45 +04:00
timer = list_entry ( ( struct list_head * ) v , struct k_itimer , list ) ;
2013-03-11 13:13:08 +04:00
notify = timer - > it_sigev_notify ;
2013-03-11 13:12:45 +04:00
seq_printf ( m , " ID: %d \n " , timer - > it_id ) ;
2017-12-07 05:23:27 +03:00
seq_printf ( m , " signal: %d/%px \n " ,
2015-04-16 02:18:17 +03:00
timer - > sigq - > info . si_signo ,
timer - > sigq - > info . si_value . sival_ptr ) ;
2013-03-11 13:13:08 +04:00
seq_printf ( m , " notify: %s/%s.%d \n " ,
2015-04-16 02:18:17 +03:00
nstr [ notify & ~ SIGEV_THREAD_ID ] ,
( notify & SIGEV_THREAD_ID ) ? " tid " : " pid " ,
pid_nr_ns ( timer - > it_pid , tp - > ns ) ) ;
2013-05-17 02:12:03 +04:00
seq_printf ( m , " ClockID: %d \n " , timer - > it_clock ) ;
2013-03-11 13:12:45 +04:00
return 0 ;
}
static const struct seq_operations proc_timers_seq_ops = {
. start = timers_start ,
. next = timers_next ,
. stop = timers_stop ,
. show = show_timer ,
} ;
static int proc_timers_open ( struct inode * inode , struct file * file )
{
struct timers_private * tp ;
tp = __seq_open_private ( file , & proc_timers_seq_ops ,
sizeof ( struct timers_private ) ) ;
if ( ! tp )
return - ENOMEM ;
tp - > pid = proc_pid ( inode ) ;
2020-05-18 21:07:38 +03:00
tp - > ns = proc_pid_ns ( inode - > i_sb ) ;
2013-03-11 13:12:45 +04:00
return 0 ;
}
static const struct file_operations proc_timers_operations = {
. open = proc_timers_open ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = seq_release_private ,
} ;
2016-03-18 00:20:57 +03:00
# endif
2012-01-11 03:11:23 +04:00
2016-03-18 00:20:54 +03:00
static ssize_t timerslack_ns_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * offset )
{
struct inode * inode = file_inode ( file ) ;
struct task_struct * p ;
u64 slack_ns ;
int err ;
err = kstrtoull_from_user ( buf , count , 10 , & slack_ns ) ;
if ( err < 0 )
return err ;
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2016-10-08 03:02:33 +03:00
if ( p ! = current ) {
2019-01-04 02:25:56 +03:00
rcu_read_lock ( ) ;
if ( ! ns_capable ( __task_cred ( p ) - > user_ns , CAP_SYS_NICE ) ) {
rcu_read_unlock ( ) ;
2016-10-08 03:02:33 +03:00
count = - EPERM ;
goto out ;
}
2019-01-04 02:25:56 +03:00
rcu_read_unlock ( ) ;
2016-03-18 00:20:54 +03:00
2016-10-08 03:02:33 +03:00
err = security_task_setscheduler ( p ) ;
if ( err ) {
count = err ;
goto out ;
}
2016-10-08 03:02:29 +03:00
}
proc: relax /proc/<tid>/timerslack_ns capability requirements
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:02:26 +03:00
task_lock ( p ) ;
if ( slack_ns = = 0 )
p - > timer_slack_ns = p - > default_timer_slack_ns ;
else
p - > timer_slack_ns = slack_ns ;
task_unlock ( p ) ;
out :
2016-03-18 00:20:54 +03:00
put_task_struct ( p ) ;
return count ;
}
static int timerslack_ns_show ( struct seq_file * m , void * v )
{
struct inode * inode = m - > private ;
struct task_struct * p ;
proc: relax /proc/<tid>/timerslack_ns capability requirements
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:02:26 +03:00
int err = 0 ;
2016-03-18 00:20:54 +03:00
p = get_proc_task ( inode ) ;
if ( ! p )
return - ESRCH ;
2016-10-08 03:02:33 +03:00
if ( p ! = current ) {
2019-01-04 02:25:56 +03:00
rcu_read_lock ( ) ;
if ( ! ns_capable ( __task_cred ( p ) - > user_ns , CAP_SYS_NICE ) ) {
rcu_read_unlock ( ) ;
2016-10-08 03:02:33 +03:00
err = - EPERM ;
goto out ;
}
2019-01-04 02:25:56 +03:00
rcu_read_unlock ( ) ;
2016-10-08 03:02:33 +03:00
err = security_task_getscheduler ( p ) ;
if ( err )
goto out ;
}
2016-10-08 03:02:29 +03:00
proc: relax /proc/<tid>/timerslack_ns capability requirements
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 03:02:26 +03:00
task_lock ( p ) ;
seq_printf ( m , " %llu \n " , p - > timer_slack_ns ) ;
task_unlock ( p ) ;
out :
2016-03-18 00:20:54 +03:00
put_task_struct ( p ) ;
return err ;
}
static int timerslack_ns_open ( struct inode * inode , struct file * filp )
{
return single_open ( filp , timerslack_ns_show , inode ) ;
}
static const struct file_operations proc_pid_set_timerslack_ns_operations = {
. open = timerslack_ns_open ,
. read = seq_read ,
. write = timerslack_ns_write ,
. llseek = seq_lseek ,
. release = single_release ,
} ;
2018-05-03 16:21:05 +03:00
static struct dentry * proc_pident_instantiate ( struct dentry * dentry ,
struct task_struct * task , const void * ptr )
2006-10-02 13:18:49 +04:00
{
2007-05-08 11:26:15 +04:00
const struct pid_entry * p = ptr ;
2006-10-02 13:18:49 +04:00
struct inode * inode ;
struct proc_inode * ei ;
2018-05-03 16:21:05 +03:00
inode = proc_pid_make_inode ( dentry - > d_sb , task , p - > mode ) ;
2006-10-02 13:18:49 +04:00
if ( ! inode )
2018-05-03 16:21:05 +03:00
return ERR_PTR ( - ENOENT ) ;
2006-10-02 13:18:49 +04:00
ei = PROC_I ( inode ) ;
if ( S_ISDIR ( inode - > i_mode ) )
2011-10-28 16:13:29 +04:00
set_nlink ( inode , 2 ) ; /* Use getattr to fix if necessary */
2006-10-02 13:18:49 +04:00
if ( p - > iop )
inode - > i_op = p - > iop ;
if ( p - > fop )
inode - > i_fop = p - > fop ;
ei - > op = p - > op ;
2018-05-03 04:26:16 +03:00
pid_update_inode ( task , inode ) ;
2011-01-07 09:49:55 +03:00
d_set_d_op ( dentry , & pid_dentry_operations ) ;
2018-05-03 16:21:05 +03:00
return d_splice_alias ( inode , dentry ) ;
2006-10-02 13:18:49 +04:00
}
2005-04-17 02:20:36 +04:00
static struct dentry * proc_pident_lookup ( struct inode * dir ,
struct dentry * dentry ,
2019-03-12 09:28:51 +03:00
const struct pid_entry * p ,
const struct pid_entry * end )
2005-04-17 02:20:36 +04:00
{
2006-06-26 11:25:55 +04:00
struct task_struct * task = get_proc_task ( dir ) ;
2018-05-03 16:21:05 +03:00
struct dentry * res = ERR_PTR ( - ENOENT ) ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:25:55 +04:00
if ( ! task )
goto out_no_task ;
2005-04-17 02:20:36 +04:00
2006-10-02 13:17:07 +04:00
/*
* Yes , it does not scale . And it should not . Don ' t add
* new entries into / proc / < tgid > / without very good reasons .
*/
2019-03-12 09:28:51 +03:00
for ( ; p < end ; p + + ) {
2005-04-17 02:20:36 +04:00
if ( p - > len ! = dentry - > d_name . len )
continue ;
2018-06-15 01:27:17 +03:00
if ( ! memcmp ( dentry - > d_name . name , p - > name , p - > len ) ) {
res = proc_pident_instantiate ( dentry , task , p ) ;
2005-04-17 02:20:36 +04:00
break ;
2018-06-15 01:27:17 +03:00
}
2005-04-17 02:20:36 +04:00
}
2006-06-26 11:25:55 +04:00
put_task_struct ( task ) ;
out_no_task :
2018-05-03 16:21:05 +03:00
return res ;
2005-04-17 02:20:36 +04:00
}
2013-05-16 20:07:31 +04:00
static int proc_pident_readdir ( struct file * file , struct dir_context * ctx ,
2007-05-08 11:26:15 +04:00
const struct pid_entry * ents , unsigned int nents )
2006-10-02 13:17:05 +04:00
{
2013-05-16 20:07:31 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
const struct pid_entry * p ;
2006-10-02 13:17:05 +04:00
if ( ! task )
2013-05-16 20:07:31 +04:00
return - ENOENT ;
2006-10-02 13:17:05 +04:00
2013-05-16 20:07:31 +04:00
if ( ! dir_emit_dots ( file , ctx ) )
goto out ;
if ( ctx - > pos > = nents + 2 )
goto out ;
2006-10-02 13:17:05 +04:00
2016-12-13 03:45:28 +03:00
for ( p = ents + ( ctx - > pos - 2 ) ; p < ents + nents ; p + + ) {
2013-05-16 20:07:31 +04:00
if ( ! proc_fill_cache ( file , ctx , p - > name , p - > len ,
proc_pident_instantiate , task , p ) )
break ;
ctx - > pos + + ;
}
2006-10-02 13:17:05 +04:00
out :
2006-10-02 13:18:49 +04:00
put_task_struct ( task ) ;
2013-05-16 20:07:31 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2006-10-02 13:17:05 +04:00
# ifdef CONFIG_SECURITY
2021-06-08 20:12:21 +03:00
static int proc_pid_attr_open ( struct inode * inode , struct file * file )
{
2021-06-15 19:26:19 +03:00
file - > private_data = NULL ;
__mem_open ( inode , file , PTRACE_MODE_READ_FSCREDS ) ;
return 0 ;
2021-06-08 20:12:21 +03:00
}
2006-10-02 13:17:05 +04:00
static ssize_t proc_pid_attr_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2007-03-12 19:17:58 +03:00
char * p = NULL ;
2006-10-02 13:17:05 +04:00
ssize_t length ;
struct task_struct * task = get_proc_task ( inode ) ;
if ( ! task )
2007-03-12 19:17:58 +03:00
return - ESRCH ;
2006-10-02 13:17:05 +04:00
2018-09-22 03:16:59 +03:00
length = security_getprocattr ( task , PROC_I ( inode ) - > op . lsm ,
2022-01-31 03:57:52 +03:00
file - > f_path . dentry - > d_name . name ,
2007-03-12 19:17:58 +03:00
& p ) ;
2006-10-02 13:17:05 +04:00
put_task_struct ( task ) ;
2007-03-12 19:17:58 +03:00
if ( length > 0 )
length = simple_read_from_buffer ( buf , count , ppos , p , length ) ;
kfree ( p ) ;
2006-10-02 13:17:05 +04:00
return length ;
2005-04-17 02:20:36 +04:00
}
2006-10-02 13:17:05 +04:00
static ssize_t proc_pid_attr_write ( struct file * file , const char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2018-08-22 07:54:30 +03:00
struct task_struct * task ;
2015-12-24 08:16:30 +03:00
void * page ;
2018-08-22 07:54:30 +03:00
int rv ;
2017-01-09 18:07:31 +03:00
2021-05-25 22:37:35 +03:00
/* A task may only write when it was the opener. */
2021-06-08 20:12:21 +03:00
if ( file - > private_data ! = current - > mm )
2021-05-25 22:37:35 +03:00
return - EPERM ;
2018-08-22 07:54:30 +03:00
rcu_read_lock ( ) ;
task = pid_task ( proc_pid ( inode ) , PIDTYPE_PID ) ;
if ( ! task ) {
rcu_read_unlock ( ) ;
return - ESRCH ;
}
2017-01-09 18:07:31 +03:00
/* A task may only write its own attributes. */
2018-08-22 07:54:30 +03:00
if ( current ! = task ) {
rcu_read_unlock ( ) ;
return - EACCES ;
}
2019-04-19 21:55:12 +03:00
/* Prevent changes to overridden credentials. */
if ( current_cred ( ) ! = current_real_cred ( ) ) {
rcu_read_unlock ( ) ;
return - EBUSY ;
}
2018-08-22 07:54:30 +03:00
rcu_read_unlock ( ) ;
2017-01-09 18:07:31 +03:00
2006-10-02 13:17:05 +04:00
if ( count > PAGE_SIZE )
count = PAGE_SIZE ;
/* No partial writes. */
if ( * ppos ! = 0 )
2018-08-22 07:54:30 +03:00
return - EINVAL ;
2006-10-02 13:17:05 +04:00
2015-12-24 08:16:30 +03:00
page = memdup_user ( buf , count ) ;
if ( IS_ERR ( page ) ) {
2018-08-22 07:54:30 +03:00
rv = PTR_ERR ( page ) ;
2006-10-02 13:17:05 +04:00
goto out ;
2015-12-24 08:16:30 +03:00
}
2006-10-02 13:17:05 +04:00
2009-05-08 16:55:27 +04:00
/* Guard against adverse ptrace interaction */
2018-08-22 07:54:30 +03:00
rv = mutex_lock_interruptible ( & current - > signal - > cred_guard_mutex ) ;
if ( rv < 0 )
2009-05-08 16:55:27 +04:00
goto out_free ;
2018-09-22 03:16:59 +03:00
rv = security_setprocattr ( PROC_I ( inode ) - > op . lsm ,
file - > f_path . dentry - > d_name . name , page ,
count ) ;
2017-01-09 18:07:31 +03:00
mutex_unlock ( & current - > signal - > cred_guard_mutex ) ;
2006-10-02 13:17:05 +04:00
out_free :
2015-12-24 08:16:30 +03:00
kfree ( page ) ;
2006-10-02 13:17:05 +04:00
out :
2018-08-22 07:54:30 +03:00
return rv ;
2006-10-02 13:17:05 +04:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_pid_attr_operations = {
2021-06-08 20:12:21 +03:00
. open = proc_pid_attr_open ,
2006-10-02 13:17:05 +04:00
. read = proc_pid_attr_read ,
. write = proc_pid_attr_write ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2021-06-08 20:12:21 +03:00
. release = mem_release ,
2006-10-02 13:17:05 +04:00
} ;
2018-09-22 03:16:59 +03:00
# define LSM_DIR_OPS(LSM) \
static int proc_ # # LSM # # _attr_dir_iterate ( struct file * filp , \
struct dir_context * ctx ) \
{ \
return proc_pident_readdir ( filp , ctx , \
LSM # # _attr_dir_stuff , \
ARRAY_SIZE ( LSM # # _attr_dir_stuff ) ) ; \
} \
\
static const struct file_operations proc_ # # LSM # # _attr_dir_ops = { \
. read = generic_read_dir , \
. iterate = proc_ # # LSM # # _attr_dir_iterate , \
. llseek = default_llseek , \
} ; \
\
static struct dentry * proc_ # # LSM # # _attr_dir_lookup ( struct inode * dir , \
struct dentry * dentry , unsigned int flags ) \
{ \
return proc_pident_lookup ( dir , dentry , \
LSM # # _attr_dir_stuff , \
2019-03-12 09:28:51 +03:00
LSM # # _attr_dir_stuff + ARRAY_SIZE ( LSM # # _attr_dir_stuff ) ) ; \
2018-09-22 03:16:59 +03:00
} \
\
static const struct inode_operations proc_ # # LSM # # _attr_dir_inode_ops = { \
. lookup = proc_ # # LSM # # _attr_dir_lookup , \
. getattr = pid_getattr , \
. setattr = proc_setattr , \
}
# ifdef CONFIG_SECURITY_SMACK
static const struct pid_entry smack_attr_dir_stuff [ ] = {
ATTR ( " smack " , " current " , 0666 ) ,
} ;
LSM_DIR_OPS ( smack ) ;
# endif
2019-02-04 16:23:14 +03:00
# ifdef CONFIG_SECURITY_APPARMOR
static const struct pid_entry apparmor_attr_dir_stuff [ ] = {
ATTR ( " apparmor " , " current " , 0666 ) ,
ATTR ( " apparmor " , " prev " , 0444 ) ,
ATTR ( " apparmor " , " exec " , 0666 ) ,
} ;
LSM_DIR_OPS ( apparmor ) ;
# endif
2007-05-08 11:26:15 +04:00
static const struct pid_entry attr_dir_stuff [ ] = {
2018-09-22 03:16:59 +03:00
ATTR ( NULL , " current " , 0666 ) ,
ATTR ( NULL , " prev " , 0444 ) ,
ATTR ( NULL , " exec " , 0666 ) ,
ATTR ( NULL , " fscreate " , 0666 ) ,
ATTR ( NULL , " keycreate " , 0666 ) ,
ATTR ( NULL , " sockcreate " , 0666 ) ,
# ifdef CONFIG_SECURITY_SMACK
DIR ( " smack " , 0555 ,
proc_smack_attr_dir_inode_ops , proc_smack_attr_dir_ops ) ,
# endif
2019-02-04 16:23:14 +03:00
# ifdef CONFIG_SECURITY_APPARMOR
DIR ( " apparmor " , 0555 ,
proc_apparmor_attr_dir_inode_ops , proc_apparmor_attr_dir_ops ) ,
# endif
2006-10-02 13:17:05 +04:00
} ;
2013-05-16 20:07:31 +04:00
static int proc_attr_dir_readdir ( struct file * file , struct dir_context * ctx )
2006-10-02 13:17:05 +04:00
{
2013-05-16 20:07:31 +04:00
return proc_pident_readdir ( file , ctx ,
attr_dir_stuff , ARRAY_SIZE ( attr_dir_stuff ) ) ;
2006-10-02 13:17:05 +04:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_attr_dir_operations = {
2005-04-17 02:20:36 +04:00
. read = generic_read_dir ,
2016-04-21 00:13:54 +03:00
. iterate_shared = proc_attr_dir_readdir ,
. llseek = generic_file_llseek ,
2005-04-17 02:20:36 +04:00
} ;
2006-10-02 13:18:50 +04:00
static struct dentry * proc_attr_dir_lookup ( struct inode * dir ,
2012-06-11 01:13:09 +04:00
struct dentry * dentry , unsigned int flags )
2006-10-02 13:17:05 +04:00
{
2006-10-02 13:18:56 +04:00
return proc_pident_lookup ( dir , dentry ,
2019-03-12 09:28:51 +03:00
attr_dir_stuff ,
attr_dir_stuff + ARRAY_SIZE ( attr_dir_stuff ) ) ;
2006-10-02 13:17:05 +04:00
}
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_attr_dir_inode_operations = {
2006-10-02 13:18:50 +04:00
. lookup = proc_attr_dir_lookup ,
2006-06-26 11:25:55 +04:00
. getattr = pid_getattr ,
2006-07-15 23:26:45 +04:00
. setattr = proc_setattr ,
2005-04-17 02:20:36 +04:00
} ;
2006-10-02 13:17:05 +04:00
# endif
2009-12-16 03:47:37 +03:00
# ifdef CONFIG_ELF_CORE
2007-07-19 12:48:28 +04:00
static ssize_t proc_coredump_filter_read ( struct file * file , char __user * buf ,
size_t count , loff_t * ppos )
{
2013-01-24 02:07:38 +04:00
struct task_struct * task = get_proc_task ( file_inode ( file ) ) ;
2007-07-19 12:48:28 +04:00
struct mm_struct * mm ;
char buffer [ PROC_NUMBUF ] ;
size_t len ;
int ret ;
if ( ! task )
return - ESRCH ;
ret = 0 ;
mm = get_task_mm ( task ) ;
if ( mm ) {
len = snprintf ( buffer , sizeof ( buffer ) , " %08lx \n " ,
( ( mm - > flags & MMF_DUMP_FILTER_MASK ) > >
MMF_DUMP_FILTER_SHIFT ) ) ;
mmput ( mm ) ;
ret = simple_read_from_buffer ( buf , count , ppos , buffer , len ) ;
}
put_task_struct ( task ) ;
return ret ;
}
static ssize_t proc_coredump_filter_write ( struct file * file ,
const char __user * buf ,
size_t count ,
loff_t * ppos )
{
struct task_struct * task ;
struct mm_struct * mm ;
unsigned int val ;
int ret ;
int i ;
unsigned long mask ;
2015-09-10 01:36:59 +03:00
ret = kstrtouint_from_user ( buf , count , 0 , & val ) ;
if ( ret < 0 )
return ret ;
2007-07-19 12:48:28 +04:00
ret = - ESRCH ;
2013-01-24 02:07:38 +04:00
task = get_proc_task ( file_inode ( file ) ) ;
2007-07-19 12:48:28 +04:00
if ( ! task )
goto out_no_task ;
mm = get_task_mm ( task ) ;
if ( ! mm )
goto out_no_mm ;
2015-12-19 01:22:01 +03:00
ret = 0 ;
2007-07-19 12:48:28 +04:00
for ( i = 0 , mask = 1 ; i < MMF_DUMP_FILTER_BITS ; i + + , mask < < = 1 ) {
if ( val & mask )
set_bit ( i + MMF_DUMP_FILTER_SHIFT , & mm - > flags ) ;
else
clear_bit ( i + MMF_DUMP_FILTER_SHIFT , & mm - > flags ) ;
}
mmput ( mm ) ;
out_no_mm :
put_task_struct ( task ) ;
out_no_task :
2015-09-10 01:36:59 +03:00
if ( ret < 0 )
return ret ;
return count ;
2007-07-19 12:48:28 +04:00
}
static const struct file_operations proc_coredump_filter_operations = {
. read = proc_coredump_filter_read ,
. write = proc_coredump_filter_write ,
2010-03-18 01:06:02 +03:00
. llseek = generic_file_llseek ,
2007-07-19 12:48:28 +04:00
} ;
# endif
2006-12-10 13:19:48 +03:00
# ifdef CONFIG_TASK_IO_ACCOUNTING
2014-08-09 01:21:50 +04:00
static int do_io_accounting ( struct task_struct * task , struct seq_file * m , int whole )
2008-07-25 12:48:49 +04:00
{
2008-07-28 02:48:12 +04:00
struct task_io_accounting acct = task - > ioac ;
2008-07-27 19:29:15 +04:00
unsigned long flags ;
2011-07-27 03:08:38 +04:00
int result ;
2008-07-27 19:29:15 +04:00
2020-12-03 23:12:00 +03:00
result = down_read_killable ( & task - > signal - > exec_update_lock ) ;
2011-07-27 03:08:38 +04:00
if ( result )
return result ;
ptrace: use fsuid, fsgid, effective creds for fs access checks
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 02:00:04 +03:00
if ( ! ptrace_may_access ( task , PTRACE_MODE_READ_FSCREDS ) ) {
2011-07-27 03:08:38 +04:00
result = - EACCES ;
goto out_unlock ;
}
2011-06-24 16:08:38 +04:00
2008-07-27 19:29:15 +04:00
if ( whole & & lock_task_sighand ( task , & flags ) ) {
struct task_struct * t = task ;
task_io_accounting_add ( & acct , & task - > signal - > ioac ) ;
while_each_thread ( task , t )
task_io_accounting_add ( & acct , & t - > ioac ) ;
unlock_task_sighand ( task , & flags ) ;
2008-07-25 12:48:49 +04:00
}
2015-04-16 02:18:17 +03:00
seq_printf ( m ,
" rchar: %llu \n "
" wchar: %llu \n "
" syscr: %llu \n "
" syscw: %llu \n "
" read_bytes: %llu \n "
" write_bytes: %llu \n "
" cancelled_write_bytes: %llu \n " ,
( unsigned long long ) acct . rchar ,
( unsigned long long ) acct . wchar ,
( unsigned long long ) acct . syscr ,
( unsigned long long ) acct . syscw ,
( unsigned long long ) acct . read_bytes ,
( unsigned long long ) acct . write_bytes ,
( unsigned long long ) acct . cancelled_write_bytes ) ;
result = 0 ;
2011-07-27 03:08:38 +04:00
out_unlock :
2020-12-03 23:12:00 +03:00
up_read ( & task - > signal - > exec_update_lock ) ;
2011-07-27 03:08:38 +04:00
return result ;
2008-07-25 12:48:49 +04:00
}
2014-08-09 01:21:50 +04:00
static int proc_tid_io_accounting ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2008-07-25 12:48:49 +04:00
{
2014-08-09 01:21:50 +04:00
return do_io_accounting ( task , m , 0 ) ;
2006-12-10 13:19:48 +03:00
}
2008-07-25 12:48:49 +04:00
2014-08-09 01:21:50 +04:00
static int proc_tgid_io_accounting ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
2008-07-25 12:48:49 +04:00
{
2014-08-09 01:21:50 +04:00
return do_io_accounting ( task , m , 1 ) ;
2008-07-25 12:48:49 +04:00
}
# endif /* CONFIG_TASK_IO_ACCOUNTING */
2006-12-10 13:19:48 +03:00
2011-11-17 12:11:58 +04:00
# ifdef CONFIG_USER_NS
static int proc_id_map_open ( struct inode * inode , struct file * file ,
2014-08-09 01:21:22 +04:00
const struct seq_operations * seq_ops )
2011-11-17 12:11:58 +04:00
{
struct user_namespace * ns = NULL ;
struct task_struct * task ;
struct seq_file * seq ;
int ret = - EINVAL ;
task = get_proc_task ( inode ) ;
if ( task ) {
rcu_read_lock ( ) ;
ns = get_user_ns ( task_cred_xxx ( task , user_ns ) ) ;
rcu_read_unlock ( ) ;
put_task_struct ( task ) ;
}
if ( ! ns )
goto err ;
ret = seq_open ( file , seq_ops ) ;
if ( ret )
goto err_put_ns ;
seq = file - > private_data ;
seq - > private = ns ;
return 0 ;
err_put_ns :
put_user_ns ( ns ) ;
err :
return ret ;
}
static int proc_id_map_release ( struct inode * inode , struct file * file )
{
struct seq_file * seq = file - > private_data ;
struct user_namespace * ns = seq - > private ;
put_user_ns ( ns ) ;
return seq_release ( inode , file ) ;
}
static int proc_uid_map_open ( struct inode * inode , struct file * file )
{
return proc_id_map_open ( inode , file , & proc_uid_seq_operations ) ;
}
static int proc_gid_map_open ( struct inode * inode , struct file * file )
{
return proc_id_map_open ( inode , file , & proc_gid_seq_operations ) ;
}
2012-08-30 12:24:05 +04:00
static int proc_projid_map_open ( struct inode * inode , struct file * file )
{
return proc_id_map_open ( inode , file , & proc_projid_seq_operations ) ;
}
2011-11-17 12:11:58 +04:00
static const struct file_operations proc_uid_map_operations = {
. open = proc_uid_map_open ,
. write = proc_uid_map_write ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = proc_id_map_release ,
} ;
static const struct file_operations proc_gid_map_operations = {
. open = proc_gid_map_open ,
. write = proc_gid_map_write ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = proc_id_map_release ,
} ;
2012-08-30 12:24:05 +04:00
static const struct file_operations proc_projid_map_operations = {
. open = proc_projid_map_open ,
. write = proc_projid_map_write ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = proc_id_map_release ,
} ;
2014-12-02 21:27:26 +03:00
static int proc_setgroups_open ( struct inode * inode , struct file * file )
{
struct user_namespace * ns = NULL ;
struct task_struct * task ;
int ret ;
ret = - ESRCH ;
task = get_proc_task ( inode ) ;
if ( task ) {
rcu_read_lock ( ) ;
ns = get_user_ns ( task_cred_xxx ( task , user_ns ) ) ;
rcu_read_unlock ( ) ;
put_task_struct ( task ) ;
}
if ( ! ns )
goto err ;
if ( file - > f_mode & FMODE_WRITE ) {
ret = - EACCES ;
if ( ! ns_capable ( ns , CAP_SYS_ADMIN ) )
goto err_put_ns ;
}
ret = single_open ( file , & proc_setgroups_show , ns ) ;
if ( ret )
goto err_put_ns ;
return 0 ;
err_put_ns :
put_user_ns ( ns ) ;
err :
return ret ;
}
static int proc_setgroups_release ( struct inode * inode , struct file * file )
{
struct seq_file * seq = file - > private_data ;
struct user_namespace * ns = seq - > private ;
int ret = single_release ( inode , file ) ;
put_user_ns ( ns ) ;
return ret ;
}
static const struct file_operations proc_setgroups_operations = {
. open = proc_setgroups_open ,
. write = proc_setgroups_write ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = proc_setgroups_release ,
} ;
2011-11-17 12:11:58 +04:00
# endif /* CONFIG_USER_NS */
2008-10-06 03:11:58 +04:00
static int proc_pid_personality ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
2011-03-23 22:52:50 +03:00
int err = lock_trace ( task ) ;
if ( ! err ) {
seq_printf ( m , " %08x \n " , task - > personality ) ;
unlock_trace ( task ) ;
}
return err ;
2008-10-06 03:11:58 +04:00
}
2017-02-14 04:42:41 +03:00
# ifdef CONFIG_LIVEPATCH
static int proc_pid_patch_state ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
seq_printf ( m , " %d \n " , task - > patch_state ) ;
return 0 ;
}
# endif /* CONFIG_LIVEPATCH */
2022-04-29 09:16:16 +03:00
# ifdef CONFIG_KSM
static int proc_pid_ksm_merging_pages ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
struct mm_struct * mm ;
mm = get_task_mm ( task ) ;
if ( mm ) {
seq_printf ( m , " %lu \n " , mm - > ksm_merging_pages ) ;
mmput ( mm ) ;
}
return 0 ;
}
ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-30 17:38:38 +03:00
static int proc_pid_ksm_stat ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
struct mm_struct * mm ;
mm = get_task_mm ( task ) ;
if ( mm ) {
seq_printf ( m , " ksm_rmap_items %lu \n " , mm - > ksm_rmap_items ) ;
2023-04-18 08:13:41 +03:00
seq_printf ( m , " ksm_merging_pages %lu \n " , mm - > ksm_merging_pages ) ;
seq_printf ( m , " ksm_process_profit %ld \n " , ksm_process_profit ( mm ) ) ;
ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-30 17:38:38 +03:00
mmput ( mm ) ;
}
return 0 ;
}
2022-04-29 09:16:16 +03:00
# endif /* CONFIG_KSM */
2018-08-17 01:17:01 +03:00
# ifdef CONFIG_STACKLEAK_METRICS
static int proc_stack_depth ( struct seq_file * m , struct pid_namespace * ns ,
struct pid * pid , struct task_struct * task )
{
unsigned long prev_depth = THREAD_SIZE -
( task - > prev_lowest_stack & ( THREAD_SIZE - 1 ) ) ;
unsigned long depth = THREAD_SIZE -
( task - > lowest_stack & ( THREAD_SIZE - 1 ) ) ;
seq_printf ( m , " previous stack depth: %lu \n stack depth: %lu \n " ,
prev_depth , depth ) ;
return 0 ;
}
# endif /* CONFIG_STACKLEAK_METRICS */
2006-10-02 13:17:05 +04:00
/*
* Thread groups
*/
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_task_operations ;
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_task_inode_operations ;
2006-10-02 13:17:07 +04:00
2007-05-08 11:26:15 +04:00
static const struct pid_entry tgid_base_stuff [ ] = {
2008-11-10 01:32:52 +03:00
DIR ( " task " , S_IRUGO | S_IXUGO , proc_task_inode_operations , proc_task_operations ) ,
DIR ( " fd " , S_IRUSR | S_IXUSR , proc_fd_inode_operations , proc_fd_operations ) ,
2012-01-11 03:11:23 +04:00
DIR ( " map_files " , S_IRUSR | S_IXUSR , proc_map_files_inode_operations , proc_map_files_operations ) ,
2021-07-01 04:54:44 +03:00
DIR ( " fdinfo " , S_IRUGO | S_IXUGO , proc_fdinfo_inode_operations , proc_fdinfo_operations ) ,
2010-03-08 03:41:34 +03:00
DIR ( " ns " , S_IRUSR | S_IXUGO , proc_ns_dir_inode_operations , proc_ns_dir_operations ) ,
2008-03-12 04:03:35 +03:00
# ifdef CONFIG_NET
2008-11-10 01:32:52 +03:00
DIR ( " net " , S_IRUGO | S_IXUGO , proc_net_inode_operations , proc_net_operations ) ,
2008-03-12 04:03:35 +03:00
# endif
2008-11-10 01:32:52 +03:00
REG ( " environ " , S_IRUSR , proc_environ_operations ) ,
2016-10-06 01:43:43 +03:00
REG ( " auxv " , S_IRUSR , proc_auxv_operations ) ,
2008-11-10 01:32:52 +03:00
ONE ( " status " , S_IRUGO , proc_pid_status ) ,
2014-04-08 02:38:36 +04:00
ONE ( " personality " , S_IRUSR , proc_pid_personality ) ,
2014-08-09 01:21:37 +04:00
ONE ( " limits " , S_IRUGO , proc_pid_limits ) ,
2007-07-09 20:52:00 +04:00
# ifdef CONFIG_SCHED_DEBUG
2008-11-10 01:32:52 +03:00
REG ( " sched " , S_IRUGO | S_IWUSR , proc_pid_sched_operations ) ,
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:18:03 +03:00
# endif
# ifdef CONFIG_SCHED_AUTOGROUP
REG ( " autogroup " , S_IRUGO | S_IWUSR , proc_pid_sched_autogroup_operations ) ,
2019-11-12 04:27:16 +03:00
# endif
# ifdef CONFIG_TIME_NS
REG ( " timens_offsets " , S_IRUGO | S_IWUSR , proc_timens_offsets_operations ) ,
2008-07-26 06:46:00 +04:00
# endif
2009-12-15 05:00:05 +03:00
REG ( " comm " , S_IRUGO | S_IWUSR , proc_pid_set_comm_operations ) ,
2008-07-26 06:46:00 +04:00
# ifdef CONFIG_HAVE_ARCH_TRACEHOOK
2014-08-09 01:21:39 +04:00
ONE ( " syscall " , S_IRUSR , proc_pid_syscall ) ,
2007-07-09 20:52:00 +04:00
# endif
2015-06-26 01:00:54 +03:00
REG ( " cmdline " , S_IRUGO , proc_pid_cmdline_ops ) ,
2008-11-10 01:32:52 +03:00
ONE ( " stat " , S_IRUGO , proc_tgid_stat ) ,
ONE ( " statm " , S_IRUGO , proc_pid_statm ) ,
procfs: mark thread stack correctly in proc/<pid>/maps
Stack for a new thread is mapped by userspace code and passed via
sys_clone. This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks. This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.
For a multithreaded program like the following:
#include <pthread.h>
void *thread_main(void *foo)
{
while(1);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}
proc/PID/maps looks like the following:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.
With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack]. All other 'stack' vmas are marked as
anonymous memory. /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].
So /proc/PID/maps will look like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack. The task level maps will
however like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
where only the vma that is being used as a stack by *that* task is
marked as [stack].
Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:
[siddhesh@localhost ~ ]$ pgrep a.out
1441
[siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
7fff6273a000 default stack anon=3 dirty=3 N0=3
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
7f8a44492000 default stack anon=2 dirty=2 N0=2
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
7fff6273a000 default stack anon=3 dirty=3 N0=3
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:04 +04:00
REG ( " maps " , S_IRUGO , proc_pid_maps_operations ) ,
2006-10-02 13:17:05 +04:00
# ifdef CONFIG_NUMA
procfs: mark thread stack correctly in proc/<pid>/maps
Stack for a new thread is mapped by userspace code and passed via
sys_clone. This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks. This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.
For a multithreaded program like the following:
#include <pthread.h>
void *thread_main(void *foo)
{
while(1);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}
proc/PID/maps looks like the following:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.
With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack]. All other 'stack' vmas are marked as
anonymous memory. /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].
So /proc/PID/maps will look like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack. The task level maps will
however like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
where only the vma that is being used as a stack by *that* task is
marked as [stack].
Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:
[siddhesh@localhost ~ ]$ pgrep a.out
1441
[siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
7fff6273a000 default stack anon=3 dirty=3 N0=3
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
7f8a44492000 default stack anon=2 dirty=2 N0=2
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
7fff6273a000 default stack anon=3 dirty=3 N0=3
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:04 +04:00
REG ( " numa_maps " , S_IRUGO , proc_pid_numa_maps_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-11-10 01:32:52 +03:00
REG ( " mem " , S_IRUSR | S_IWUSR , proc_mem_operations ) ,
LNK ( " cwd " , proc_cwd_link ) ,
LNK ( " root " , proc_root_link ) ,
LNK ( " exe " , proc_exe_link ) ,
REG ( " mounts " , S_IRUGO , proc_mounts_operations ) ,
REG ( " mountinfo " , S_IRUGO , proc_mountinfo_operations ) ,
REG ( " mountstats " , S_IRUSR , proc_mountstats_operations ) ,
2008-02-05 09:29:07 +03:00
# ifdef CONFIG_PROC_PAGE_MONITOR
2008-11-10 01:32:52 +03:00
REG ( " clear_refs " , S_IWUSR , proc_clear_refs_operations ) ,
procfs: mark thread stack correctly in proc/<pid>/maps
Stack for a new thread is mapped by userspace code and passed via
sys_clone. This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks. This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.
For a multithreaded program like the following:
#include <pthread.h>
void *thread_main(void *foo)
{
while(1);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}
proc/PID/maps looks like the following:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.
With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack]. All other 'stack' vmas are marked as
anonymous memory. /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].
So /proc/PID/maps will look like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack. The task level maps will
however like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
where only the vma that is being used as a stack by *that* task is
marked as [stack].
Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:
[siddhesh@localhost ~ ]$ pgrep a.out
1441
[siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
7fff6273a000 default stack anon=3 dirty=3 N0=3
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
7f8a44492000 default stack anon=2 dirty=2 N0=2
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
7fff6273a000 default stack anon=3 dirty=3 N0=3
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:04 +04:00
REG ( " smaps " , S_IRUGO , proc_pid_smaps_operations ) ,
mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.
Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes. This sampling process involves
opening /proc/pid/smaps and summing certain fields. For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.
smaps_rollup improves the situation. It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process. In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs. Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.
By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings. For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.
One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status. We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.
The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA. In this way, summation happens automatically. We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.
Benchmarks:
using smaps:
iterations:1000 pid:1163 pss:220023808
0m29.46s real 0m08.28s user 0m20.98s system
using smaps_rollup:
iterations:1000 pid:1163 pss:220702720
0m04.39s real 0m00.03s user 0m04.31s system
We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows. Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more. IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.
Tim said:
: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
: application via the settings app. This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications. We have found
: an enormous number of bugs (in the Android platform, in Google's own
: apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that. The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend). PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it. We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent. Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process. The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release. After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process. Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward. I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).
Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 02:25:08 +03:00
REG ( " smaps_rollup " , S_IRUGO , proc_pid_smaps_rollup_operations ) ,
2014-04-08 02:38:38 +04:00
REG ( " pagemap " , S_IRUSR , proc_pagemap_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
# ifdef CONFIG_SECURITY
2008-11-10 01:32:52 +03:00
DIR ( " attr " , S_IRUGO | S_IXUGO , proc_attr_dir_inode_operations , proc_attr_dir_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
# ifdef CONFIG_KALLSYMS
2014-08-09 01:21:44 +04:00
ONE ( " wchan " , S_IRUGO , proc_pid_wchan ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-11-10 11:26:08 +03:00
# ifdef CONFIG_STACKTRACE
2014-04-08 02:38:36 +04:00
ONE ( " stack " , S_IRUSR , proc_pid_stack ) ,
2006-10-02 13:17:05 +04:00
# endif
2015-06-30 12:06:03 +03:00
# ifdef CONFIG_SCHED_INFO
2014-08-09 01:21:46 +04:00
ONE ( " schedstat " , S_IRUGO , proc_pid_schedstat ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-01-25 23:08:34 +03:00
# ifdef CONFIG_LATENCYTOP
2008-11-10 01:32:52 +03:00
REG ( " latency " , S_IRUGO , proc_lstats_operations ) ,
2008-01-25 23:08:34 +03:00
# endif
2007-10-19 10:39:39 +04:00
# ifdef CONFIG_PROC_PID_CPUSET
2014-09-18 12:03:36 +04:00
ONE ( " cpuset " , S_IRUGO , proc_cpuset_show ) ,
2007-10-19 10:39:35 +04:00
# endif
# ifdef CONFIG_CGROUPS
2014-09-18 12:03:15 +04:00
ONE ( " cgroup " , S_IRUGO , proc_cgroup_show ) ,
2020-01-15 12:28:51 +03:00
# endif
# ifdef CONFIG_PROC_CPU_RESCTRL
ONE ( " cpu_resctrl_groups " , S_IRUGO , proc_resctrl_show ) ,
2006-10-02 13:17:05 +04:00
# endif
2014-08-09 01:21:48 +04:00
ONE ( " oom_score " , S_IRUGO , proc_oom_score ) ,
2012-11-13 05:53:04 +04:00
REG ( " oom_adj " , S_IRUGO | S_IWUSR , proc_oom_adj_operations ) ,
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
REG ( " oom_score_adj " , S_IRUGO | S_IWUSR , proc_oom_score_adj_operations ) ,
2019-01-23 01:06:39 +03:00
# ifdef CONFIG_AUDIT
2008-11-10 01:32:52 +03:00
REG ( " loginuid " , S_IWUSR | S_IRUGO , proc_loginuid_operations ) ,
REG ( " sessionid " , S_IRUGO , proc_sessionid_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
2006-12-08 13:39:47 +03:00
# ifdef CONFIG_FAULT_INJECTION
2008-11-10 01:32:52 +03:00
REG ( " make-it-fail " , S_IRUGO | S_IWUSR , proc_fault_inject_operations ) ,
2017-07-15 00:50:00 +03:00
REG ( " fail-nth " , 0644 , proc_fail_nth_operations ) ,
2006-12-08 13:39:47 +03:00
# endif
2009-12-16 03:47:37 +03:00
# ifdef CONFIG_ELF_CORE
2008-11-10 01:32:52 +03:00
REG ( " coredump_filter " , S_IRUGO | S_IWUSR , proc_coredump_filter_operations ) ,
2007-07-19 12:48:28 +04:00
# endif
2006-12-10 13:19:48 +03:00
# ifdef CONFIG_TASK_IO_ACCOUNTING
2014-08-09 01:21:50 +04:00
ONE ( " io " , S_IRUSR , proc_tgid_io_accounting ) ,
2006-12-10 13:19:48 +03:00
# endif
2011-11-17 12:11:58 +04:00
# ifdef CONFIG_USER_NS
REG ( " uid_map " , S_IRUGO | S_IWUSR , proc_uid_map_operations ) ,
REG ( " gid_map " , S_IRUGO | S_IWUSR , proc_gid_map_operations ) ,
2012-08-30 12:24:05 +04:00
REG ( " projid_map " , S_IRUGO | S_IWUSR , proc_projid_map_operations ) ,
2014-12-02 21:27:26 +03:00
REG ( " setgroups " , S_IRUGO | S_IWUSR , proc_setgroups_operations ) ,
2011-11-17 12:11:58 +04:00
# endif
2017-01-21 08:09:08 +03:00
# if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_POSIX_TIMERS)
2013-03-11 13:12:45 +04:00
REG ( " timers " , S_IRUGO , proc_timers_operations ) ,
# endif
2016-03-18 00:20:54 +03:00
REG ( " timerslack_ns " , S_IRUGO | S_IWUGO , proc_pid_set_timerslack_ns_operations ) ,
2017-02-14 04:42:41 +03:00
# ifdef CONFIG_LIVEPATCH
ONE ( " patch_state " , S_IRUSR , proc_pid_patch_state ) ,
# endif
2018-08-17 01:17:01 +03:00
# ifdef CONFIG_STACKLEAK_METRICS
ONE ( " stack_depth " , S_IRUGO , proc_stack_depth ) ,
# endif
2019-06-06 04:22:34 +03:00
# ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE ( " arch_status " , S_IRUGO , proc_pid_arch_status ) ,
# endif
2020-11-11 16:33:54 +03:00
# ifdef CONFIG_SECCOMP_CACHE_DEBUG
ONE ( " seccomp_cache " , S_IRUSR , proc_pid_seccomp_cache ) ,
# endif
2022-04-29 09:16:16 +03:00
# ifdef CONFIG_KSM
ONE ( " ksm_merging_pages " , S_IRUSR , proc_pid_ksm_merging_pages ) ,
ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-30 17:38:38 +03:00
ONE ( " ksm_stat " , S_IRUSR , proc_pid_ksm_stat ) ,
2022-04-29 09:16:16 +03:00
# endif
2006-10-02 13:17:05 +04:00
} ;
2005-04-17 02:20:36 +04:00
2013-05-16 20:07:31 +04:00
static int proc_tgid_base_readdir ( struct file * file , struct dir_context * ctx )
2005-04-17 02:20:36 +04:00
{
2013-05-16 20:07:31 +04:00
return proc_pident_readdir ( file , ctx ,
tgid_base_stuff , ARRAY_SIZE ( tgid_base_stuff ) ) ;
2005-04-17 02:20:36 +04:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_tgid_base_operations = {
2005-04-17 02:20:36 +04:00
. read = generic_read_dir ,
2016-04-21 00:13:54 +03:00
. iterate_shared = proc_tgid_base_readdir ,
. llseek = generic_file_llseek ,
2005-04-17 02:20:36 +04:00
} ;
signal: add pidfd_send_signal() syscall
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2018-11-19 02:51:56 +03:00
struct pid * tgid_pidfd_to_pid ( const struct file * file )
{
2019-06-27 12:35:14 +03:00
if ( file - > f_op ! = & proc_tgid_base_operations )
signal: add pidfd_send_signal() syscall
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2018-11-19 02:51:56 +03:00
return ERR_PTR ( - EBADF ) ;
return proc_pid ( file_inode ( file ) ) ;
}
2012-06-11 01:13:09 +04:00
static struct dentry * proc_tgid_base_lookup ( struct inode * dir , struct dentry * dentry , unsigned int flags )
{
2006-10-02 13:18:56 +04:00
return proc_pident_lookup ( dir , dentry ,
2019-03-12 09:28:51 +03:00
tgid_base_stuff ,
tgid_base_stuff + ARRAY_SIZE ( tgid_base_stuff ) ) ;
2005-04-17 02:20:36 +04:00
}
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_tgid_base_inode_operations = {
2006-10-02 13:17:05 +04:00
. lookup = proc_tgid_base_lookup ,
2006-06-26 11:25:55 +04:00
. getattr = pid_getattr ,
2006-07-15 23:26:45 +04:00
. setattr = proc_setattr ,
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
. permission = proc_pid_permission ,
2005-04-17 02:20:36 +04:00
} ;
2007-10-22 08:00:10 +04:00
/**
2020-02-20 03:22:26 +03:00
* proc_flush_pid - Remove dcache entries for @ pid from the / proc dcache .
* @ pid : pid that should be flushed .
2007-10-22 08:00:10 +04:00
*
2020-02-20 03:22:26 +03:00
* This function walks a list of inodes ( that belong to any proc
* filesystem ) that are attached to the pid and flushes them from
* the dentry cache .
2007-10-22 08:00:10 +04:00
*
* It is safe and reasonable to cache / proc entries for a task until
* that task exits . After that they just clog up the dcache with
* useless entries , possibly causing useful dcache entries to be
2020-02-20 03:22:26 +03:00
* flushed instead . This routine is provided to flush those useless
* dcache entries when a process is reaped .
2007-10-22 08:00:10 +04:00
*
* NOTE : This routine is just an optimization so it does not guarantee
2020-02-20 03:22:26 +03:00
* that no dcache entries will exist after a process is reaped
* it just makes it very unlikely that any will persist .
2007-10-19 10:40:03 +04:00
*/
2020-02-20 03:22:26 +03:00
void proc_flush_pid ( struct pid * pid )
2007-10-19 10:40:03 +04:00
{
2020-04-07 17:43:04 +03:00
proc_invalidate_siblings_dcache ( & pid - > inodes , & pid - > lock ) ;
2007-10-19 10:40:03 +04:00
}
2018-05-03 16:21:05 +03:00
static struct dentry * proc_pid_instantiate ( struct dentry * dentry ,
2013-06-15 11:15:20 +04:00
struct task_struct * task , const void * ptr )
2006-10-02 13:18:49 +04:00
{
struct inode * inode ;
2022-07-13 16:00:29 +03:00
inode = proc_pid_make_base_inode ( dentry - > d_sb , task ,
S_IFDIR | S_IRUGO | S_IXUGO ) ;
2006-10-02 13:18:49 +04:00
if ( ! inode )
2018-05-03 16:21:05 +03:00
return ERR_PTR ( - ENOENT ) ;
2006-10-02 13:18:49 +04:00
inode - > i_op = & proc_tgid_base_inode_operations ;
inode - > i_fop = & proc_tgid_base_operations ;
inode - > i_flags | = S_IMMUTABLE ;
2008-06-06 09:46:53 +04:00
2016-12-13 03:45:32 +03:00
set_nlink ( inode , nlink_tgid ) ;
2018-05-03 04:26:16 +03:00
pid_update_inode ( task , inode ) ;
2006-10-02 13:18:49 +04:00
2011-01-07 09:49:55 +03:00
d_set_d_op ( dentry , & pid_dentry_operations ) ;
2018-05-03 16:21:05 +03:00
return d_splice_alias ( inode , dentry ) ;
2006-10-02 13:18:49 +04:00
}
2019-03-06 02:50:29 +03:00
struct dentry * proc_pid_lookup ( struct dentry * dentry , unsigned int flags )
2005-04-17 02:20:36 +04:00
{
struct task_struct * task ;
unsigned tgid ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct proc_fs_info * fs_info ;
2007-10-19 10:40:14 +04:00
struct pid_namespace * ns ;
2018-05-03 16:21:05 +03:00
struct dentry * result = ERR_PTR ( - ENOENT ) ;
2005-04-17 02:20:36 +04:00
2014-08-09 01:21:25 +04:00
tgid = name_to_int ( & dentry - > d_name ) ;
2005-04-17 02:20:36 +04:00
if ( tgid = = ~ 0U )
goto out ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
fs_info = proc_sb_info ( dentry - > d_sb ) ;
ns = fs_info - > pid_ns ;
2006-06-26 11:25:51 +04:00
rcu_read_lock ( ) ;
2007-10-19 10:40:14 +04:00
task = find_task_by_pid_ns ( tgid , ns ) ;
2005-04-17 02:20:36 +04:00
if ( task )
get_task_struct ( task ) ;
2006-06-26 11:25:51 +04:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
if ( ! task )
goto out ;
2020-04-19 17:10:53 +03:00
/* Limit procfs to only ptraceable tasks */
if ( fs_info - > hide_pid = = HIDEPID_NOT_PTRACEABLE ) {
if ( ! has_pid_permissions ( fs_info , task , HIDEPID_NO_ACCESS ) )
goto out_put_task ;
}
2018-05-03 16:21:05 +03:00
result = proc_pid_instantiate ( dentry , task , NULL ) ;
2020-04-19 17:10:53 +03:00
out_put_task :
2005-04-17 02:20:36 +04:00
put_task_struct ( task ) ;
out :
2018-05-03 16:21:05 +03:00
return result ;
2005-04-17 02:20:36 +04:00
}
/*
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
* Find the first task with tgid > = tgid
2006-06-26 11:25:50 +04:00
*
2005-04-17 02:20:36 +04:00
*/
2007-11-29 03:21:26 +03:00
struct tgid_iter {
unsigned int tgid ;
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
struct task_struct * task ;
2007-11-29 03:21:26 +03:00
} ;
static struct tgid_iter next_tgid ( struct pid_namespace * ns , struct tgid_iter iter )
{
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
struct pid * pid ;
2005-04-17 02:20:36 +04:00
2007-11-29 03:21:26 +03:00
if ( iter . task )
put_task_struct ( iter . task ) ;
2006-06-26 11:25:51 +04:00
rcu_read_lock ( ) ;
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
retry :
2007-11-29 03:21:26 +03:00
iter . task = NULL ;
pid = find_ge_pid ( iter . tgid , ns ) ;
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
if ( pid ) {
2007-11-29 03:21:26 +03:00
iter . tgid = pid_nr_ns ( pid , ns ) ;
2020-02-25 03:53:09 +03:00
iter . task = pid_task ( pid , PIDTYPE_TGID ) ;
if ( ! iter . task ) {
2007-11-29 03:21:26 +03:00
iter . tgid + = 1 ;
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
goto retry ;
2007-11-29 03:21:26 +03:00
}
get_task_struct ( iter . task ) ;
2006-06-26 11:25:50 +04:00
}
2006-06-26 11:25:51 +04:00
rcu_read_unlock ( ) ;
2007-11-29 03:21:26 +03:00
return iter ;
2005-04-17 02:20:36 +04:00
}
2014-07-31 14:10:50 +04:00
# define TGID_OFFSET (FIRST_PROCESS_ENTRY + 2)
[PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 13:17:04 +04:00
2005-04-17 02:20:36 +04:00
/* for the /proc/ directory itself, after non-process stuff has been done */
2013-05-16 20:07:31 +04:00
int proc_pid_readdir ( struct file * file , struct dir_context * ctx )
2005-04-17 02:20:36 +04:00
{
2007-11-29 03:21:26 +03:00
struct tgid_iter iter ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct proc_fs_info * fs_info = proc_sb_info ( file_inode ( file ) - > i_sb ) ;
2020-05-18 21:07:38 +03:00
struct pid_namespace * ns = proc_pid_ns ( file_inode ( file ) - > i_sb ) ;
2013-05-16 20:07:31 +04:00
loff_t pos = ctx - > pos ;
2005-04-17 02:20:36 +04:00
2013-03-30 03:27:05 +04:00
if ( pos > = PID_MAX_LIMIT + TGID_OFFSET )
2013-05-16 20:07:31 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
2014-07-31 14:10:50 +04:00
if ( pos = = TGID_OFFSET - 2 ) {
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct inode * inode = d_inode ( fs_info - > proc_self ) ;
2013-06-15 10:45:10 +04:00
if ( ! dir_emit ( ctx , " self " , 4 , inode - > i_ino , DT_LNK ) )
2013-05-16 20:07:31 +04:00
return 0 ;
2014-07-31 14:10:50 +04:00
ctx - > pos = pos = pos + 1 ;
}
if ( pos = = TGID_OFFSET - 1 ) {
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct inode * inode = d_inode ( fs_info - > proc_thread_self ) ;
2014-07-31 14:10:50 +04:00
if ( ! dir_emit ( ctx , " thread-self " , 11 , inode - > i_ino , DT_LNK ) )
return 0 ;
ctx - > pos = pos = pos + 1 ;
2013-03-30 03:27:05 +04:00
}
2014-07-31 14:10:50 +04:00
iter . tgid = pos - TGID_OFFSET ;
2007-11-29 03:21:26 +03:00
iter . task = NULL ;
for ( iter = next_tgid ( ns , iter ) ;
iter . task ;
iter . tgid + = 1 , iter = next_tgid ( ns , iter ) ) {
2018-02-07 02:36:51 +03:00
char name [ 10 + 1 ] ;
2018-06-08 03:10:10 +03:00
unsigned int len ;
2017-01-25 02:18:07 +03:00
cond_resched ( ) ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
if ( ! has_pid_permissions ( fs_info , iter . task , HIDEPID_INVISIBLE ) )
2013-05-16 20:07:31 +04:00
continue ;
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
2018-02-07 02:36:51 +03:00
len = snprintf ( name , sizeof ( name ) , " %u " , iter . tgid ) ;
2013-05-16 20:07:31 +04:00
ctx - > pos = iter . tgid + TGID_OFFSET ;
if ( ! proc_fill_cache ( file , ctx , name , len ,
proc_pid_instantiate , iter . task , NULL ) ) {
2007-11-29 03:21:26 +03:00
put_task_struct ( iter . task ) ;
2013-05-16 20:07:31 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2006-06-26 11:25:50 +04:00
}
2013-05-16 20:07:31 +04:00
ctx - > pos = PID_MAX_LIMIT + TGID_OFFSET ;
2006-06-26 11:25:50 +04:00
return 0 ;
}
2005-04-17 02:20:36 +04:00
2016-05-21 03:00:08 +03:00
/*
* proc_tid_comm_permission is a special permission function exclusively
* used for the node / proc / < pid > / task / < tid > / comm .
* It bypasses generic permission checks in the case where a task of the same
* task group attempts to access the node .
* The rationale behind this is that glibc and bionic access this node for
* cross thread naming ( pthread_set / getname_np ( ! self ) ) . However , if
* PR_SET_DUMPABLE gets set to 0 this node among others becomes uid = 0 gid = 0 ,
* which locks out the cross thread naming implementation .
* This function makes sure that the node is always accessible for members of
* same thread group .
*/
2023-01-13 14:49:22 +03:00
static int proc_tid_comm_permission ( struct mnt_idmap * idmap ,
2021-01-21 16:19:43 +03:00
struct inode * inode , int mask )
2016-05-21 03:00:08 +03:00
{
bool is_same_tgroup ;
struct task_struct * task ;
task = get_proc_task ( inode ) ;
if ( ! task )
return - ESRCH ;
is_same_tgroup = same_thread_group ( current , task ) ;
put_task_struct ( task ) ;
if ( likely ( is_same_tgroup & & ! ( mask & MAY_EXEC ) ) ) {
/* This file (/proc/<pid>/task/<tid>/comm) can always be
* read or written by the members of the corresponding
* thread group .
*/
return 0 ;
}
2023-01-13 14:49:22 +03:00
return generic_permission ( & nop_mnt_idmap , inode , mask ) ;
2016-05-21 03:00:08 +03:00
}
static const struct inode_operations proc_tid_comm_inode_operations = {
. permission = proc_tid_comm_permission ,
} ;
2006-10-02 13:17:05 +04:00
/*
* Tasks
*/
2007-05-08 11:26:15 +04:00
static const struct pid_entry tid_base_stuff [ ] = {
2008-11-10 01:32:52 +03:00
DIR ( " fd " , S_IRUSR | S_IXUSR , proc_fd_inode_operations , proc_fd_operations ) ,
2021-07-01 04:54:44 +03:00
DIR ( " fdinfo " , S_IRUGO | S_IXUGO , proc_fdinfo_inode_operations , proc_fdinfo_operations ) ,
2010-03-08 03:41:34 +03:00
DIR ( " ns " , S_IRUSR | S_IXUGO , proc_ns_dir_inode_operations , proc_ns_dir_operations ) ,
2014-08-01 03:27:08 +04:00
# ifdef CONFIG_NET
DIR ( " net " , S_IRUGO | S_IXUGO , proc_net_inode_operations , proc_net_operations ) ,
# endif
2008-11-10 01:32:52 +03:00
REG ( " environ " , S_IRUSR , proc_environ_operations ) ,
2016-10-06 01:43:43 +03:00
REG ( " auxv " , S_IRUSR , proc_auxv_operations ) ,
2008-11-10 01:32:52 +03:00
ONE ( " status " , S_IRUGO , proc_pid_status ) ,
2014-04-08 02:38:36 +04:00
ONE ( " personality " , S_IRUSR , proc_pid_personality ) ,
2014-08-09 01:21:37 +04:00
ONE ( " limits " , S_IRUGO , proc_pid_limits ) ,
2007-07-09 20:52:00 +04:00
# ifdef CONFIG_SCHED_DEBUG
2008-11-10 01:32:52 +03:00
REG ( " sched " , S_IRUGO | S_IWUSR , proc_pid_sched_operations ) ,
2008-07-26 06:46:00 +04:00
# endif
2016-05-21 03:00:08 +03:00
NOD ( " comm " , S_IFREG | S_IRUGO | S_IWUSR ,
& proc_tid_comm_inode_operations ,
& proc_pid_set_comm_operations , { } ) ,
2008-07-26 06:46:00 +04:00
# ifdef CONFIG_HAVE_ARCH_TRACEHOOK
2014-08-09 01:21:39 +04:00
ONE ( " syscall " , S_IRUSR , proc_pid_syscall ) ,
2007-07-09 20:52:00 +04:00
# endif
2015-06-26 01:00:54 +03:00
REG ( " cmdline " , S_IRUGO , proc_pid_cmdline_ops ) ,
2008-11-10 01:32:52 +03:00
ONE ( " stat " , S_IRUGO , proc_tid_stat ) ,
ONE ( " statm " , S_IRUGO , proc_pid_statm ) ,
2018-08-22 07:52:48 +03:00
REG ( " maps " , S_IRUGO , proc_pid_maps_operations ) ,
2015-06-26 01:00:57 +03:00
# ifdef CONFIG_PROC_CHILDREN
2012-06-01 03:26:43 +04:00
REG ( " children " , S_IRUGO , proc_tid_children_operations ) ,
# endif
2006-10-02 13:17:05 +04:00
# ifdef CONFIG_NUMA
2018-08-22 07:52:48 +03:00
REG ( " numa_maps " , S_IRUGO , proc_pid_numa_maps_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-11-10 01:32:52 +03:00
REG ( " mem " , S_IRUSR | S_IWUSR , proc_mem_operations ) ,
LNK ( " cwd " , proc_cwd_link ) ,
LNK ( " root " , proc_root_link ) ,
LNK ( " exe " , proc_exe_link ) ,
REG ( " mounts " , S_IRUGO , proc_mounts_operations ) ,
REG ( " mountinfo " , S_IRUGO , proc_mountinfo_operations ) ,
2008-02-05 09:29:07 +03:00
# ifdef CONFIG_PROC_PAGE_MONITOR
2008-11-10 01:32:52 +03:00
REG ( " clear_refs " , S_IWUSR , proc_clear_refs_operations ) ,
2018-08-22 07:52:48 +03:00
REG ( " smaps " , S_IRUGO , proc_pid_smaps_operations ) ,
mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.
Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes. This sampling process involves
opening /proc/pid/smaps and summing certain fields. For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.
smaps_rollup improves the situation. It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process. In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs. Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.
By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings. For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.
One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status. We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.
The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA. In this way, summation happens automatically. We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.
Benchmarks:
using smaps:
iterations:1000 pid:1163 pss:220023808
0m29.46s real 0m08.28s user 0m20.98s system
using smaps_rollup:
iterations:1000 pid:1163 pss:220702720
0m04.39s real 0m00.03s user 0m04.31s system
We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows. Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more. IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.
Tim said:
: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
: application via the settings app. This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications. We have found
: an enormous number of bugs (in the Android platform, in Google's own
: apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that. The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend). PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it. We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent. Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process. The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release. After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process. Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward. I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).
Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 02:25:08 +03:00
REG ( " smaps_rollup " , S_IRUGO , proc_pid_smaps_rollup_operations ) ,
2014-04-08 02:38:38 +04:00
REG ( " pagemap " , S_IRUSR , proc_pagemap_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
# ifdef CONFIG_SECURITY
2008-11-10 01:32:52 +03:00
DIR ( " attr " , S_IRUGO | S_IXUGO , proc_attr_dir_inode_operations , proc_attr_dir_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
# ifdef CONFIG_KALLSYMS
2014-08-09 01:21:44 +04:00
ONE ( " wchan " , S_IRUGO , proc_pid_wchan ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-11-10 11:26:08 +03:00
# ifdef CONFIG_STACKTRACE
2014-04-08 02:38:36 +04:00
ONE ( " stack " , S_IRUSR , proc_pid_stack ) ,
2006-10-02 13:17:05 +04:00
# endif
2015-06-30 12:06:03 +03:00
# ifdef CONFIG_SCHED_INFO
2014-08-09 01:21:46 +04:00
ONE ( " schedstat " , S_IRUGO , proc_pid_schedstat ) ,
2006-10-02 13:17:05 +04:00
# endif
2008-01-25 23:08:34 +03:00
# ifdef CONFIG_LATENCYTOP
2008-11-10 01:32:52 +03:00
REG ( " latency " , S_IRUGO , proc_lstats_operations ) ,
2008-01-25 23:08:34 +03:00
# endif
2007-10-19 10:39:39 +04:00
# ifdef CONFIG_PROC_PID_CPUSET
2014-09-18 12:03:36 +04:00
ONE ( " cpuset " , S_IRUGO , proc_cpuset_show ) ,
2007-10-19 10:39:35 +04:00
# endif
# ifdef CONFIG_CGROUPS
2014-09-18 12:03:15 +04:00
ONE ( " cgroup " , S_IRUGO , proc_cgroup_show ) ,
2020-01-15 12:28:51 +03:00
# endif
# ifdef CONFIG_PROC_CPU_RESCTRL
ONE ( " cpu_resctrl_groups " , S_IRUGO , proc_resctrl_show ) ,
2006-10-02 13:17:05 +04:00
# endif
2014-08-09 01:21:48 +04:00
ONE ( " oom_score " , S_IRUGO , proc_oom_score ) ,
2012-11-13 05:53:04 +04:00
REG ( " oom_adj " , S_IRUGO | S_IWUSR , proc_oom_adj_operations ) ,
oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10 04:19:46 +04:00
REG ( " oom_score_adj " , S_IRUGO | S_IWUSR , proc_oom_score_adj_operations ) ,
2019-01-23 01:06:39 +03:00
# ifdef CONFIG_AUDIT
2008-11-10 01:32:52 +03:00
REG ( " loginuid " , S_IWUSR | S_IRUGO , proc_loginuid_operations ) ,
2011-02-16 05:24:05 +03:00
REG ( " sessionid " , S_IRUGO , proc_sessionid_operations ) ,
2006-10-02 13:17:05 +04:00
# endif
2006-12-08 13:39:47 +03:00
# ifdef CONFIG_FAULT_INJECTION
2008-11-10 01:32:52 +03:00
REG ( " make-it-fail " , S_IRUGO | S_IWUSR , proc_fault_inject_operations ) ,
2017-07-15 00:49:57 +03:00
REG ( " fail-nth " , 0644 , proc_fail_nth_operations ) ,
2006-12-08 13:39:47 +03:00
# endif
2008-07-25 12:48:49 +04:00
# ifdef CONFIG_TASK_IO_ACCOUNTING
2014-08-09 01:21:50 +04:00
ONE ( " io " , S_IRUSR , proc_tid_io_accounting ) ,
2008-07-25 12:48:49 +04:00
# endif
2011-11-17 12:11:58 +04:00
# ifdef CONFIG_USER_NS
REG ( " uid_map " , S_IRUGO | S_IWUSR , proc_uid_map_operations ) ,
REG ( " gid_map " , S_IRUGO | S_IWUSR , proc_gid_map_operations ) ,
2012-08-30 12:24:05 +04:00
REG ( " projid_map " , S_IRUGO | S_IWUSR , proc_projid_map_operations ) ,
2014-12-02 21:27:26 +03:00
REG ( " setgroups " , S_IRUGO | S_IWUSR , proc_setgroups_operations ) ,
2011-11-17 12:11:58 +04:00
# endif
2017-02-14 04:42:41 +03:00
# ifdef CONFIG_LIVEPATCH
ONE ( " patch_state " , S_IRUSR , proc_pid_patch_state ) ,
# endif
2019-06-06 04:22:34 +03:00
# ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE ( " arch_status " , S_IRUGO , proc_pid_arch_status ) ,
# endif
2020-11-11 16:33:54 +03:00
# ifdef CONFIG_SECCOMP_CACHE_DEBUG
ONE ( " seccomp_cache " , S_IRUSR , proc_pid_seccomp_cache ) ,
# endif
2022-04-29 09:16:16 +03:00
# ifdef CONFIG_KSM
ONE ( " ksm_merging_pages " , S_IRUSR , proc_pid_ksm_merging_pages ) ,
ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-30 17:38:38 +03:00
ONE ( " ksm_stat " , S_IRUSR , proc_pid_ksm_stat ) ,
2022-04-29 09:16:16 +03:00
# endif
2006-10-02 13:17:05 +04:00
} ;
2013-05-16 20:07:31 +04:00
static int proc_tid_base_readdir ( struct file * file , struct dir_context * ctx )
2006-10-02 13:17:05 +04:00
{
2013-05-16 20:07:31 +04:00
return proc_pident_readdir ( file , ctx ,
tid_base_stuff , ARRAY_SIZE ( tid_base_stuff ) ) ;
2006-10-02 13:17:05 +04:00
}
2012-06-11 01:13:09 +04:00
static struct dentry * proc_tid_base_lookup ( struct inode * dir , struct dentry * dentry , unsigned int flags )
{
2006-10-02 13:18:56 +04:00
return proc_pident_lookup ( dir , dentry ,
2019-03-12 09:28:51 +03:00
tid_base_stuff ,
tid_base_stuff + ARRAY_SIZE ( tid_base_stuff ) ) ;
2006-10-02 13:17:05 +04:00
}
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_tid_base_operations = {
2006-10-02 13:17:05 +04:00
. read = generic_read_dir ,
2016-04-21 00:13:54 +03:00
. iterate_shared = proc_tid_base_readdir ,
. llseek = generic_file_llseek ,
2006-10-02 13:17:05 +04:00
} ;
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_tid_base_inode_operations = {
2006-10-02 13:17:05 +04:00
. lookup = proc_tid_base_lookup ,
. getattr = pid_getattr ,
. setattr = proc_setattr ,
} ;
2018-05-03 16:21:05 +03:00
static struct dentry * proc_task_instantiate ( struct dentry * dentry ,
struct task_struct * task , const void * ptr )
2006-10-02 13:18:49 +04:00
{
struct inode * inode ;
2022-07-13 16:00:29 +03:00
inode = proc_pid_make_base_inode ( dentry - > d_sb , task ,
S_IFDIR | S_IRUGO | S_IXUGO ) ;
2006-10-02 13:18:49 +04:00
if ( ! inode )
2018-05-03 16:21:05 +03:00
return ERR_PTR ( - ENOENT ) ;
2018-05-03 04:26:16 +03:00
2006-10-02 13:18:49 +04:00
inode - > i_op = & proc_tid_base_inode_operations ;
inode - > i_fop = & proc_tid_base_operations ;
2018-05-03 04:26:16 +03:00
inode - > i_flags | = S_IMMUTABLE ;
2008-06-06 09:46:53 +04:00
2016-12-13 03:45:32 +03:00
set_nlink ( inode , nlink_tid ) ;
2018-05-03 04:26:16 +03:00
pid_update_inode ( task , inode ) ;
2006-10-02 13:18:49 +04:00
2011-01-07 09:49:55 +03:00
d_set_d_op ( dentry , & pid_dentry_operations ) ;
2018-05-03 16:21:05 +03:00
return d_splice_alias ( inode , dentry ) ;
2006-10-02 13:18:49 +04:00
}
2012-06-11 01:13:09 +04:00
static struct dentry * proc_task_lookup ( struct inode * dir , struct dentry * dentry , unsigned int flags )
2006-10-02 13:17:05 +04:00
{
struct task_struct * task ;
struct task_struct * leader = get_proc_task ( dir ) ;
unsigned tid ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
struct proc_fs_info * fs_info ;
2007-10-19 10:40:14 +04:00
struct pid_namespace * ns ;
2018-05-03 16:21:05 +03:00
struct dentry * result = ERR_PTR ( - ENOENT ) ;
2006-10-02 13:17:05 +04:00
if ( ! leader )
goto out_no_task ;
2014-08-09 01:21:25 +04:00
tid = name_to_int ( & dentry - > d_name ) ;
2006-10-02 13:17:05 +04:00
if ( tid = = ~ 0U )
goto out ;
proc: allow to mount many instances of proc in one pid namespace
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-19 17:10:52 +03:00
fs_info = proc_sb_info ( dentry - > d_sb ) ;
ns = fs_info - > pid_ns ;
2006-10-02 13:17:05 +04:00
rcu_read_lock ( ) ;
2007-10-19 10:40:14 +04:00
task = find_task_by_pid_ns ( tid , ns ) ;
2006-10-02 13:17:05 +04:00
if ( task )
get_task_struct ( task ) ;
rcu_read_unlock ( ) ;
if ( ! task )
goto out ;
2007-10-19 10:40:18 +04:00
if ( ! same_thread_group ( leader , task ) )
2006-10-02 13:17:05 +04:00
goto out_drop_task ;
2018-05-03 16:21:05 +03:00
result = proc_task_instantiate ( dentry , task , NULL ) ;
2006-10-02 13:17:05 +04:00
out_drop_task :
put_task_struct ( task ) ;
out :
put_task_struct ( leader ) ;
out_no_task :
2018-05-03 16:21:05 +03:00
return result ;
2006-10-02 13:17:05 +04:00
}
2006-06-26 11:25:50 +04:00
/*
* Find the first tid of a thread group to return to user space .
*
* Usually this is just the thread group leader , but if the users
* buffer was too small or there was a seek into the middle of the
* directory we have more work todo .
*
* In the case of a short read we start with find_task_by_pid .
*
* In the case of a seek we start with the leader and walk nr
* threads past it .
*/
2014-01-24 03:55:40 +04:00
static struct task_struct * first_tid ( struct pid * pid , int tid , loff_t f_pos ,
struct pid_namespace * ns )
2006-06-26 11:25:50 +04:00
{
2014-01-24 03:55:39 +04:00
struct task_struct * pos , * task ;
2014-01-24 03:55:40 +04:00
unsigned long nr = f_pos ;
if ( nr ! = f_pos ) /* 32bit overflow? */
return NULL ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:26:01 +04:00
rcu_read_lock ( ) ;
2014-01-24 03:55:39 +04:00
task = pid_task ( pid , PIDTYPE_PID ) ;
if ( ! task )
goto fail ;
/* Attempt to start with the tid of a thread */
2014-01-24 03:55:40 +04:00
if ( tid & & nr ) {
2007-10-19 10:40:14 +04:00
pos = find_task_by_pid_ns ( tid , ns ) ;
2014-01-24 03:55:39 +04:00
if ( pos & & same_thread_group ( pos , task ) )
2006-06-26 11:26:01 +04:00
goto found ;
2006-06-26 11:25:50 +04:00
}
2005-04-17 02:20:36 +04:00
2006-06-26 11:25:50 +04:00
/* If nr exceeds the number of threads there is nothing todo */
2014-01-24 03:55:40 +04:00
if ( nr > = get_nr_threads ( task ) )
2014-01-24 03:55:38 +04:00
goto fail ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:26:01 +04:00
/* If we haven't found our starting place yet start
* with the leader and walk nr threads forward .
2006-06-26 11:25:50 +04:00
*/
2014-01-24 03:55:39 +04:00
pos = task = task - > group_leader ;
2014-01-24 03:55:38 +04:00
do {
2014-01-24 03:55:40 +04:00
if ( ! nr - - )
2014-01-24 03:55:38 +04:00
goto found ;
2014-01-24 03:55:39 +04:00
} while_each_thread ( task , pos ) ;
2014-01-24 03:55:38 +04:00
fail :
pos = NULL ;
goto out ;
2006-06-26 11:26:01 +04:00
found :
get_task_struct ( pos ) ;
out :
2006-06-26 11:26:01 +04:00
rcu_read_unlock ( ) ;
2006-06-26 11:25:50 +04:00
return pos ;
}
/*
* Find the next thread in the thread list .
* Return NULL if there is an error or no next thread .
*
* The reference to the input task_struct is released .
*/
static struct task_struct * next_tid ( struct task_struct * start )
{
2006-06-26 11:26:02 +04:00
struct task_struct * pos = NULL ;
2006-06-26 11:26:01 +04:00
rcu_read_lock ( ) ;
2006-06-26 11:26:02 +04:00
if ( pid_alive ( start ) ) {
2006-06-26 11:25:50 +04:00
pos = next_thread ( start ) ;
2006-06-26 11:26:02 +04:00
if ( thread_group_leader ( pos ) )
pos = NULL ;
else
get_task_struct ( pos ) ;
}
2006-06-26 11:26:01 +04:00
rcu_read_unlock ( ) ;
2006-06-26 11:25:50 +04:00
put_task_struct ( start ) ;
return pos ;
2005-04-17 02:20:36 +04:00
}
/* for the /proc/TGID/task/ directories */
2013-05-16 20:07:31 +04:00
static int proc_task_readdir ( struct file * file , struct dir_context * ctx )
2005-04-17 02:20:36 +04:00
{
2014-01-24 03:55:39 +04:00
struct inode * inode = file_inode ( file ) ;
struct task_struct * task ;
2007-10-19 10:40:14 +04:00
struct pid_namespace * ns ;
2013-05-16 20:07:31 +04:00
int tid ;
2005-04-17 02:20:36 +04:00
2014-01-24 03:55:39 +04:00
if ( proc_inode_is_dead ( inode ) )
2013-05-16 20:07:31 +04:00
return - ENOENT ;
2005-04-17 02:20:36 +04:00
2013-05-16 20:07:31 +04:00
if ( ! dir_emit_dots ( file , ctx ) )
2014-01-24 03:55:39 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
2006-06-26 11:25:50 +04:00
/* f_version caches the tgid value that the last readdir call couldn't
* return . lseek aka telldir automagically resets f_version to 0.
*/
2020-05-18 21:07:38 +03:00
ns = proc_pid_ns ( inode - > i_sb ) ;
2013-05-16 20:07:31 +04:00
tid = ( int ) file - > f_version ;
file - > f_version = 0 ;
2014-01-24 03:55:39 +04:00
for ( task = first_tid ( proc_pid ( inode ) , tid , ctx - > pos - 2 , ns ) ;
2006-06-26 11:25:50 +04:00
task ;
2013-05-16 20:07:31 +04:00
task = next_tid ( task ) , ctx - > pos + + ) {
2018-02-07 02:36:51 +03:00
char name [ 10 + 1 ] ;
2018-06-08 03:10:10 +03:00
unsigned int len ;
2021-11-09 05:31:30 +03:00
2007-10-19 10:40:14 +04:00
tid = task_pid_nr_ns ( task , ns ) ;
2021-11-09 05:31:30 +03:00
if ( ! tid )
continue ; /* The task has just exited. */
2018-02-07 02:36:51 +03:00
len = snprintf ( name , sizeof ( name ) , " %u " , tid ) ;
2013-05-16 20:07:31 +04:00
if ( ! proc_fill_cache ( file , ctx , name , len ,
proc_task_instantiate , task , NULL ) ) {
2006-06-26 11:25:50 +04:00
/* returning this tgid failed, save it as the first
* pid for the next readir call */
2013-05-16 20:07:31 +04:00
file - > f_version = ( u64 ) tid ;
2006-06-26 11:25:50 +04:00
put_task_struct ( task ) ;
2005-04-17 02:20:36 +04:00
break ;
2006-06-26 11:25:50 +04:00
}
2005-04-17 02:20:36 +04:00
}
2014-01-24 03:55:39 +04:00
2013-05-16 20:07:31 +04:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2006-06-26 11:25:47 +04:00
2023-01-13 14:49:12 +03:00
static int proc_task_getattr ( struct mnt_idmap * idmap ,
2021-01-21 16:19:43 +03:00
const struct path * path , struct kstat * stat ,
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 19:46:22 +03:00
u32 request_mask , unsigned int query_flags )
2006-06-26 11:25:47 +04:00
{
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 19:46:22 +03:00
struct inode * inode = d_inode ( path - > dentry ) ;
2006-06-26 11:25:55 +04:00
struct task_struct * p = get_proc_task ( inode ) ;
2023-01-13 14:49:12 +03:00
generic_fillattr ( & nop_mnt_idmap , inode , stat ) ;
2006-06-26 11:25:47 +04:00
2006-06-26 11:25:55 +04:00
if ( p ) {
stat - > nlink + = get_nr_threads ( p ) ;
put_task_struct ( p ) ;
2006-06-26 11:25:47 +04:00
}
return 0 ;
}
2006-10-02 13:17:05 +04:00
2007-02-12 11:55:40 +03:00
static const struct inode_operations proc_task_inode_operations = {
2006-10-02 13:17:05 +04:00
. lookup = proc_task_lookup ,
. getattr = proc_task_getattr ,
. setattr = proc_setattr ,
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 03:11:31 +04:00
. permission = proc_pid_permission ,
2006-10-02 13:17:05 +04:00
} ;
2007-02-12 11:55:34 +03:00
static const struct file_operations proc_task_operations = {
2006-10-02 13:17:05 +04:00
. read = generic_read_dir ,
2016-04-21 00:13:54 +03:00
. iterate_shared = proc_task_readdir ,
. llseek = generic_file_llseek ,
2006-10-02 13:17:05 +04:00
} ;
2016-12-13 03:45:32 +03:00
void __init set_proc_pid_nlink ( void )
{
nlink_tid = pid_entry_nlink ( tid_base_stuff , ARRAY_SIZE ( tid_base_stuff ) ) ;
nlink_tgid = pid_entry_nlink ( tgid_base_stuff , ARRAY_SIZE ( tgid_base_stuff ) ) ;
}