2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
* sysctl . c : General linux system control interface
*
* Begun 24 March 1995 , Stephen Tweedie
* Added / proc support , Dec 1995
* Added bdflush entry and intvec min / max checking , 2 / 23 / 96 , Tom Dyas .
* Added hooks for / proc / sys / net ( minor , minor patch ) , 96 / 4 / 1 , Mike Shaver .
* Added kernel / java - { interpreter , appletviewer } , 96 / 5 / 10 , Mike Shaver .
* Dynamic registration fixes , Stephen Tweedie .
* Added kswapd - interval , ctrl - alt - del , printk stuff , 1 / 8 / 97 , Chris Horn .
* Made sysctl support optional via CONFIG_SYSCTL , 1 / 10 / 97 , Chris
* Horn .
* Added proc_doulongvec_ms_jiffies_minmax , 09 / 08 / 99 , Carlos H . Bauer .
* Added proc_doulongvec_minmax , 09 / 08 / 99 , Carlos H . Bauer .
* Changed linked lists to use list . h instead of lists . h , 02 / 24 / 00 , Bill
* Wendling .
* The list_for_each ( ) macro wasn ' t appropriate for the sysctl loop .
* Removed it and replaced it with older style , 03 / 23 / 00 , Bill Wendling
*/
# include <linux/module.h>
# include <linux/mm.h>
# include <linux/swap.h>
# include <linux/slab.h>
# include <linux/sysctl.h>
2012-03-29 01:42:50 +04:00
# include <linux/bitmap.h>
2010-03-11 02:23:59 +03:00
# include <linux/signal.h>
2021-07-01 04:54:59 +03:00
# include <linux/panic.h>
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 03:59:41 +03:00
# include <linux/printk.h>
2005-04-17 02:20:36 +04:00
# include <linux/proc_fs.h>
V3 file capabilities: alter behavior of cap_setpcap
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:05:59 +04:00
# include <linux/security.h>
2005-04-17 02:20:36 +04:00
# include <linux/ctype.h>
2012-07-31 01:42:48 +04:00
# include <linux/kmemleak.h>
2021-12-29 03:49:13 +03:00
# include <linux/filter.h>
2007-07-17 15:03:45 +04:00
# include <linux/fs.h>
2005-04-17 02:20:36 +04:00
# include <linux/init.h>
# include <linux/kernel.h>
2005-11-11 07:33:52 +03:00
# include <linux/kobject.h>
2005-08-16 09:18:02 +04:00
# include <linux/net.h>
2005-04-17 02:20:36 +04:00
# include <linux/sysrq.h>
# include <linux/highuid.h>
# include <linux/writeback.h>
2009-09-22 18:18:09 +04:00
# include <linux/ratelimit.h>
2010-05-25 01:32:28 +04:00
# include <linux/compaction.h>
2005-04-17 02:20:36 +04:00
# include <linux/hugetlb.h>
# include <linux/initrd.h>
2008-04-29 12:01:32 +04:00
# include <linux/key.h>
2005-04-17 02:20:36 +04:00
# include <linux/times.h>
# include <linux/limits.h>
# include <linux/dcache.h>
# include <linux/syscalls.h>
2008-07-24 08:27:03 +04:00
# include <linux/vmstat.h>
2006-02-21 05:27:58 +03:00
# include <linux/nfs_fs.h>
# include <linux/acpi.h>
2007-07-18 05:37:02 +04:00
# include <linux/reboot.h>
2008-05-12 23:20:43 +04:00
# include <linux/ftrace.h>
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:02:48 +04:00
# include <linux/perf_event.h>
2010-08-10 04:18:56 +04:00
# include <linux/oom.h>
2011-04-02 01:07:50 +04:00
# include <linux/kmod.h>
2011-11-01 04:11:20 +04:00
# include <linux/capability.h>
2012-02-13 07:58:52 +04:00
# include <linux/binfmts.h>
2013-02-07 19:46:59 +04:00
# include <linux/sched/sysctl.h>
2016-09-28 08:27:17 +03:00
# include <linux/mount.h>
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 03:16:41 +03:00
# include <linux/userfaultfd_k.h>
2020-04-24 09:43:36 +03:00
# include <linux/pid.h>
2005-04-17 02:20:36 +04:00
2019-03-08 03:29:40 +03:00
# include "../lib/kstrtox.h"
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <asm/processor.h>
2006-09-30 03:47:55 +04:00
# ifdef CONFIG_X86
# include <asm/nmi.h>
2006-12-07 04:14:11 +03:00
# include <asm/stacktrace.h>
2008-01-30 15:30:05 +03:00
# include <asm/io.h>
2006-09-30 03:47:55 +04:00
# endif
2012-03-28 21:30:03 +04:00
# ifdef CONFIG_SPARC
# include <asm/setup.h>
# endif
2010-03-11 02:24:09 +03:00
# ifdef CONFIG_RT_MUTEXES
# include <linux/rtmutex.h>
# endif
2010-02-13 01:19:19 +03:00
2005-04-17 02:20:36 +04:00
# if defined(CONFIG_SYSCTL)
2007-10-17 10:26:09 +04:00
/* Constants used for minimum and maximum */
2016-04-21 18:28:50 +03:00
# ifdef CONFIG_PERF_EVENTS
2022-01-22 09:11:14 +03:00
static const int six_hundred_forty_kb = 640 * 1024 ;
2016-04-21 18:28:50 +03:00
# endif
2007-10-17 10:26:09 +04:00
2009-05-01 02:08:57 +04:00
2022-01-22 09:11:09 +03:00
static const int ngroups_max = NGROUPS_MAX ;
2011-11-01 04:11:20 +04:00
static const int cap_last_cap = CAP_LAST_CAP ;
2005-04-17 02:20:36 +04:00
2006-10-20 10:28:34 +04:00
# ifdef CONFIG_PROC_SYSCTL
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
2017-07-13 00:33:30 +03:00
/**
* enum sysctl_writes_mode - supported sysctl write modes
*
* @ SYSCTL_WRITES_LEGACY : each write syscall must fully contain the sysctl value
2019-07-17 02:26:54 +03:00
* to be written , and multiple writes on the same sysctl file descriptor
* will rewrite the sysctl value , regardless of file position . No warning
* is issued when the initial position is not 0.
2017-07-13 00:33:30 +03:00
* @ SYSCTL_WRITES_WARN : same as above but warn when the initial file position is
2019-07-17 02:26:54 +03:00
* not 0.
2017-07-13 00:33:30 +03:00
* @ SYSCTL_WRITES_STRICT : writes to numeric sysctl entries must always be at
2019-07-17 02:26:54 +03:00
* file position 0 and the value must be fully contained in the buffer
* sent to the write syscall . If dealing with strings respect the file
* position , but restrict this to the max length of the buffer , anything
* passed the max length will be ignored . Multiple writes will append
* to the buffer .
2017-07-13 00:33:30 +03:00
*
* These write modes control how current file position affects the behavior of
* updating sysctl values through the proc interface on each write .
*/
enum sysctl_writes_mode {
SYSCTL_WRITES_LEGACY = - 1 ,
SYSCTL_WRITES_WARN = 0 ,
SYSCTL_WRITES_STRICT = 1 ,
} ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
2017-07-13 00:33:30 +03:00
static enum sysctl_writes_mode sysctl_writes_strict = SYSCTL_WRITES_STRICT ;
2020-04-24 09:43:37 +03:00
# endif /* CONFIG_PROC_SYSCTL */
2018-03-10 17:14:51 +03:00
2019-09-24 01:38:47 +03:00
# if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \
defined ( CONFIG_ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT )
2005-04-17 02:20:36 +04:00
int sysctl_legacy_va_layout ;
# endif
2010-05-25 01:32:31 +04:00
# ifdef CONFIG_COMPACTION
2022-01-22 09:11:19 +03:00
/* min_extfrag_threshold is SYSCTL_ZERO */ ;
2022-01-22 09:11:14 +03:00
static const int max_extfrag_threshold = 1000 ;
2010-05-25 01:32:31 +04:00
# endif
2020-04-24 09:43:37 +03:00
# endif /* CONFIG_SYSCTL */
/*
* / proc / sys support
*/
2006-09-27 12:51:04 +04:00
# ifdef CONFIG_PROC_SYSCTL
2005-04-17 02:20:36 +04:00
2014-06-07 01:37:17 +04:00
static int _proc_do_string ( char * data , int maxlen , int write ,
2020-04-24 09:43:38 +03:00
char * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
size_t len ;
2020-04-24 09:43:38 +03:00
char c , * p ;
2007-02-10 12:46:38 +03:00
if ( ! data | | ! maxlen | | ! * lenp ) {
2005-04-17 02:20:36 +04:00
* lenp = 0 ;
return 0 ;
}
2007-02-10 12:46:38 +03:00
2005-04-17 02:20:36 +04:00
if ( write ) {
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
if ( sysctl_writes_strict = = SYSCTL_WRITES_STRICT ) {
/* Only continue writes not past the end of buffer. */
len = strlen ( data ) ;
if ( len > maxlen - 1 )
len = maxlen - 1 ;
if ( * ppos > len )
return 0 ;
len = * ppos ;
} else {
/* Start writing from beginning of buffer. */
len = 0 ;
}
2014-06-07 01:37:18 +04:00
* ppos + = * lenp ;
2005-04-17 02:20:36 +04:00
p = buffer ;
2014-06-07 01:37:18 +04:00
while ( ( p - buffer ) < * lenp & & len < maxlen - 1 ) {
2020-04-24 09:43:38 +03:00
c = * ( p + + ) ;
2005-04-17 02:20:36 +04:00
if ( c = = 0 | | c = = ' \n ' )
break ;
2014-06-07 01:37:18 +04:00
data [ len + + ] = c ;
2005-04-17 02:20:36 +04:00
}
2014-06-07 01:37:17 +04:00
data [ len ] = 0 ;
2005-04-17 02:20:36 +04:00
} else {
2006-10-02 13:18:04 +04:00
len = strlen ( data ) ;
if ( len > maxlen )
len = maxlen ;
2007-02-10 12:46:38 +03:00
if ( * ppos > len ) {
* lenp = 0 ;
return 0 ;
}
data + = * ppos ;
len - = * ppos ;
2005-04-17 02:20:36 +04:00
if ( len > * lenp )
len = * lenp ;
if ( len )
2020-04-24 09:43:38 +03:00
memcpy ( buffer , data , len ) ;
2005-04-17 02:20:36 +04:00
if ( len < * lenp ) {
2020-04-24 09:43:38 +03:00
buffer [ len ] = ' \n ' ;
2005-04-17 02:20:36 +04:00
len + + ;
}
* lenp = len ;
* ppos + = len ;
}
return 0 ;
}
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
static void warn_sysctl_write ( struct ctl_table * table )
{
pr_warn_once ( " %s wrote to %s when file position was not 0! \n "
" This will not be supported in the future. To silence this \n "
" warning, set kernel.sysctl_writes_strict = -1 \n " ,
current - > comm , table - > procname ) ;
}
2017-07-13 00:33:33 +03:00
/**
2018-08-22 08:01:06 +03:00
* proc_first_pos_non_zero_ignore - check if first position is allowed
2017-07-13 00:33:33 +03:00
* @ ppos : file position
* @ table : the sysctl table
*
* Returns true if the first position is non - zero and the sysctl_writes_strict
* mode indicates this is not allowed for numeric input types . String proc
2018-08-22 08:01:06 +03:00
* handlers can ignore the return value .
2017-07-13 00:33:33 +03:00
*/
static bool proc_first_pos_non_zero_ignore ( loff_t * ppos ,
struct ctl_table * table )
{
if ( ! * ppos )
return false ;
switch ( sysctl_writes_strict ) {
case SYSCTL_WRITES_STRICT :
return true ;
case SYSCTL_WRITES_WARN :
warn_sysctl_write ( table ) ;
return false ;
default :
return false ;
}
}
2006-10-02 13:18:04 +04:00
/**
* proc_dostring - read a string sysctl
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes a string from / to the user buffer . If the kernel
* buffer provided is not large enough to hold the string , the
* string is truncated . The copied string is % NULL - terminated .
* If the string is being read by the user process , it is copied
* and a newline ' \n ' is added . It is truncated if the buffer is
* not large enough .
*
* Returns 0 on success .
*/
2009-09-24 02:57:19 +04:00
int proc_dostring ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2006-10-02 13:18:04 +04:00
{
2017-07-13 00:33:33 +03:00
if ( write )
proc_first_pos_non_zero_ignore ( ppos , table ) ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
2020-04-24 09:43:38 +03:00
return _proc_do_string ( table - > data , table - > maxlen , write , buffer , lenp ,
ppos ) ;
2006-10-02 13:18:04 +04:00
}
2010-05-05 04:26:45 +04:00
static size_t proc_skip_spaces ( char * * buf )
{
size_t ret ;
char * tmp = skip_spaces ( * buf ) ;
ret = tmp - * buf ;
* buf = tmp ;
return ret ;
}
2010-05-05 04:26:55 +04:00
static void proc_skip_char ( char * * buf , size_t * size , const char v )
{
while ( * size ) {
if ( * * buf ! = v )
break ;
( * size ) - - ;
( * buf ) + + ;
}
}
2019-03-08 03:29:40 +03:00
/**
* strtoul_lenient - parse an ASCII formatted integer from a buffer and only
* fail on overflow
*
* @ cp : kernel buffer containing the string to parse
* @ endp : pointer to store the trailing characters
* @ base : the base to use
* @ res : where the parsed integer will be stored
*
* In case of success 0 is returned and @ res will contain the parsed integer ,
* @ endp will hold any trailing characters .
* This function will fail the parse on overflow . If there wasn ' t an overflow
* the function will defer the decision what characters count as invalid to the
* caller .
*/
static int strtoul_lenient ( const char * cp , char * * endp , unsigned int base ,
unsigned long * res )
{
unsigned long long result ;
unsigned int rv ;
cp = _parse_integer_fixup_radix ( cp , & base ) ;
rv = _parse_integer ( cp , base , & result ) ;
if ( ( rv & KSTRTOX_OVERFLOW ) | | ( result ! = ( unsigned long ) result ) )
return - ERANGE ;
cp + = rv ;
if ( endp )
* endp = ( char * ) cp ;
* res = ( unsigned long ) result ;
return 0 ;
}
2010-05-05 04:26:45 +04:00
# define TMPBUFLEN 22
/**
2010-05-21 22:29:53 +04:00
* proc_get_long - reads an ASCII formatted integer from a user buffer
2010-05-05 04:26:45 +04:00
*
2010-05-21 22:29:53 +04:00
* @ buf : a kernel buffer
* @ size : size of the kernel buffer
* @ val : this is where the number will be stored
* @ neg : set to % TRUE if number is negative
* @ perm_tr : a vector which contains the allowed trailers
* @ perm_tr_len : size of the perm_tr vector
* @ tr : pointer to store the trailer character
2010-05-05 04:26:45 +04:00
*
2010-05-21 22:29:53 +04:00
* In case of success % 0 is returned and @ buf and @ size are updated with
* the amount of bytes read . If @ tr is non - NULL and a trailing
* character exists ( size is non - zero after returning from this
* function ) , @ tr is updated with the trailing character .
2010-05-05 04:26:45 +04:00
*/
static int proc_get_long ( char * * buf , size_t * size ,
unsigned long * val , bool * neg ,
const char * perm_tr , unsigned perm_tr_len , char * tr )
{
int len ;
char * p , tmp [ TMPBUFLEN ] ;
if ( ! * size )
return - EINVAL ;
len = * size ;
if ( len > TMPBUFLEN - 1 )
len = TMPBUFLEN - 1 ;
memcpy ( tmp , * buf , len ) ;
tmp [ len ] = 0 ;
p = tmp ;
if ( * p = = ' - ' & & * size > 1 ) {
* neg = true ;
p + + ;
} else
* neg = false ;
if ( ! isdigit ( * p ) )
return - EINVAL ;
2019-03-08 03:29:40 +03:00
if ( strtoul_lenient ( p , & p , 0 , val ) )
return - EINVAL ;
2010-05-05 04:26:45 +04:00
len = p - tmp ;
/* We don't know if the next char is whitespace thus we may accept
* invalid integers ( e . g . 1234. . . a ) or two integers instead of one
* ( e . g . 123. . .1 ) . So lets not allow such large numbers . */
if ( len = = TMPBUFLEN - 1 )
return - EINVAL ;
if ( len < * size & & perm_tr_len & & ! memchr ( perm_tr , * p , perm_tr_len ) )
return - EINVAL ;
2005-04-17 02:20:36 +04:00
2010-05-05 04:26:45 +04:00
if ( tr & & ( len < * size ) )
* tr = * p ;
* buf + = len ;
* size - = len ;
return 0 ;
}
/**
2010-05-21 22:29:53 +04:00
* proc_put_long - converts an integer to a decimal ASCII formatted string
2010-05-05 04:26:45 +04:00
*
2010-05-21 22:29:53 +04:00
* @ buf : the user buffer
* @ size : the size of the user buffer
* @ val : the integer to be converted
* @ neg : sign of the number , % TRUE for negative
2010-05-05 04:26:45 +04:00
*
2020-04-24 09:43:38 +03:00
* In case of success @ buf and @ size are updated with the amount of bytes
* written .
2010-05-05 04:26:45 +04:00
*/
2020-04-24 09:43:38 +03:00
static void proc_put_long ( void * * buf , size_t * size , unsigned long val , bool neg )
2010-05-05 04:26:45 +04:00
{
int len ;
char tmp [ TMPBUFLEN ] , * p = tmp ;
sprintf ( p , " %s%lu " , neg ? " - " : " " , val ) ;
len = strlen ( tmp ) ;
if ( len > * size )
len = * size ;
2020-04-24 09:43:38 +03:00
memcpy ( * buf , tmp , len ) ;
2010-05-05 04:26:45 +04:00
* size - = len ;
* buf + = len ;
}
# undef TMPBUFLEN
2020-04-24 09:43:38 +03:00
static void proc_put_char ( void * * buf , size_t * size , char c )
2010-05-05 04:26:45 +04:00
{
if ( * size ) {
2020-04-24 09:43:38 +03:00
char * * buffer = ( char * * ) buf ;
* * buffer = c ;
( * size ) - - ;
( * buffer ) + + ;
2010-05-05 04:26:45 +04:00
* buf = * buffer ;
}
}
2005-04-17 02:20:36 +04:00
2021-08-03 13:59:36 +03:00
static int do_proc_dobool_conv ( bool * negp , unsigned long * lvalp ,
int * valp ,
int write , void * data )
{
if ( write ) {
* ( bool * ) valp = * lvalp ;
} else {
int val = * ( bool * ) valp ;
* lvalp = ( unsigned long ) val ;
* negp = false ;
}
return 0 ;
}
2010-05-05 04:26:45 +04:00
static int do_proc_dointvec_conv ( bool * negp , unsigned long * lvalp ,
2005-04-17 02:20:36 +04:00
int * valp ,
int write , void * data )
{
if ( write ) {
2015-04-16 22:48:07 +03:00
if ( * negp ) {
if ( * lvalp > ( unsigned long ) INT_MAX + 1 )
return - EINVAL ;
2022-07-07 02:39:52 +03:00
WRITE_ONCE ( * valp , - * lvalp ) ;
2015-04-16 22:48:07 +03:00
} else {
if ( * lvalp > ( unsigned long ) INT_MAX )
return - EINVAL ;
2022-07-07 02:39:52 +03:00
WRITE_ONCE ( * valp , * lvalp ) ;
2015-04-16 22:48:07 +03:00
}
2005-04-17 02:20:36 +04:00
} else {
2022-07-07 02:39:52 +03:00
int val = READ_ONCE ( * valp ) ;
2005-04-17 02:20:36 +04:00
if ( val < 0 ) {
2010-05-05 04:26:45 +04:00
* negp = true ;
2015-09-10 01:39:06 +03:00
* lvalp = - ( unsigned long ) val ;
2005-04-17 02:20:36 +04:00
} else {
2010-05-05 04:26:45 +04:00
* negp = false ;
2005-04-17 02:20:36 +04:00
* lvalp = ( unsigned long ) val ;
}
}
return 0 ;
}
2017-07-13 00:33:36 +03:00
static int do_proc_douintvec_conv ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data )
2016-08-26 01:16:51 +03:00
{
if ( write ) {
2017-04-07 18:51:07 +03:00
if ( * lvalp > UINT_MAX )
return - EINVAL ;
2022-07-07 02:39:53 +03:00
WRITE_ONCE ( * valp , * lvalp ) ;
2016-08-26 01:16:51 +03:00
} else {
2022-07-07 02:39:53 +03:00
unsigned int val = READ_ONCE ( * valp ) ;
2016-08-26 01:16:51 +03:00
* lvalp = ( unsigned long ) val ;
}
return 0 ;
}
2010-05-05 04:26:45 +04:00
static const char proc_wspace_sep [ ] = { ' ' , ' \t ' , ' \n ' } ;
2007-10-18 14:05:22 +04:00
static int __do_proc_dointvec ( void * tbl_data , struct ctl_table * table ,
2020-04-24 09:43:38 +03:00
int write , void * buffer ,
2006-10-02 13:18:23 +04:00
size_t * lenp , loff_t * ppos ,
2010-05-05 04:26:45 +04:00
int ( * conv ) ( bool * negp , unsigned long * lvalp , int * valp ,
2005-04-17 02:20:36 +04:00
int write , void * data ) ,
void * data )
{
2010-05-05 04:26:45 +04:00
int * i , vleft , first = 1 , err = 0 ;
size_t left ;
2020-04-24 09:43:38 +03:00
char * p ;
2005-04-17 02:20:36 +04:00
2010-05-05 04:26:45 +04:00
if ( ! tbl_data | | ! table - > maxlen | | ! * lenp | | ( * ppos & & ! write ) ) {
2005-04-17 02:20:36 +04:00
* lenp = 0 ;
return 0 ;
}
2006-10-02 13:18:23 +04:00
i = ( int * ) tbl_data ;
2005-04-17 02:20:36 +04:00
vleft = table - > maxlen / sizeof ( * i ) ;
left = * lenp ;
if ( ! conv )
conv = do_proc_dointvec_conv ;
2010-05-05 04:26:45 +04:00
if ( write ) {
2017-07-13 00:33:33 +03:00
if ( proc_first_pos_non_zero_ignore ( ppos , table ) )
goto out ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
2010-05-05 04:26:45 +04:00
if ( left > PAGE_SIZE - 1 )
left = PAGE_SIZE - 1 ;
2020-04-24 09:43:38 +03:00
p = buffer ;
2010-05-05 04:26:45 +04:00
}
2005-04-17 02:20:36 +04:00
for ( ; left & & vleft - - ; i + + , first = 0 ) {
2010-05-05 04:26:45 +04:00
unsigned long lval ;
bool neg ;
2005-04-17 02:20:36 +04:00
2010-05-05 04:26:45 +04:00
if ( write ) {
2015-12-24 08:13:10 +03:00
left - = proc_skip_spaces ( & p ) ;
2005-04-17 02:20:36 +04:00
2010-05-26 03:10:14 +04:00
if ( ! left )
break ;
2015-12-24 08:13:10 +03:00
err = proc_get_long ( & p , & left , & lval , & neg ,
2010-05-05 04:26:45 +04:00
proc_wspace_sep ,
sizeof ( proc_wspace_sep ) , NULL ) ;
if ( err )
2005-04-17 02:20:36 +04:00
break ;
2010-05-05 04:26:45 +04:00
if ( conv ( & neg , & lval , i , 1 , data ) ) {
err = - EINVAL ;
2005-04-17 02:20:36 +04:00
break ;
2010-05-05 04:26:45 +04:00
}
2005-04-17 02:20:36 +04:00
} else {
2010-05-05 04:26:45 +04:00
if ( conv ( & neg , & lval , i , 0 , data ) ) {
err = - EINVAL ;
break ;
}
2005-04-17 02:20:36 +04:00
if ( ! first )
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' \t ' ) ;
proc_put_long ( & buffer , & left , lval , neg ) ;
2005-04-17 02:20:36 +04:00
}
}
2010-05-05 04:26:45 +04:00
if ( ! write & & ! first & & left & & ! err )
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' \n ' ) ;
2010-05-26 03:10:14 +04:00
if ( write & & ! err & & left )
2015-12-24 08:13:10 +03:00
left - = proc_skip_spaces ( & p ) ;
2020-04-24 09:43:38 +03:00
if ( write & & first )
return err ? : - EINVAL ;
2005-04-17 02:20:36 +04:00
* lenp - = left ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
out :
2005-04-17 02:20:36 +04:00
* ppos + = * lenp ;
2010-05-05 04:26:45 +04:00
return err ;
2005-04-17 02:20:36 +04:00
}
2009-09-24 02:57:19 +04:00
static int do_proc_dointvec ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos ,
2010-05-05 04:26:45 +04:00
int ( * conv ) ( bool * negp , unsigned long * lvalp , int * valp ,
2006-10-02 13:18:23 +04:00
int write , void * data ) ,
void * data )
{
2009-09-24 02:57:19 +04:00
return __do_proc_dointvec ( table - > data , table , write ,
2006-10-02 13:18:23 +04:00
buffer , lenp , ppos , conv , data ) ;
}
2017-07-13 00:33:36 +03:00
static int do_proc_douintvec_w ( unsigned int * tbl_data ,
struct ctl_table * table ,
2020-04-24 09:43:38 +03:00
void * buffer ,
2017-07-13 00:33:36 +03:00
size_t * lenp , loff_t * ppos ,
int ( * conv ) ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data ) ,
void * data )
{
unsigned long lval ;
int err = 0 ;
size_t left ;
bool neg ;
2020-04-24 09:43:38 +03:00
char * p = buffer ;
2017-07-13 00:33:36 +03:00
left = * lenp ;
if ( proc_first_pos_non_zero_ignore ( ppos , table ) )
goto bail_early ;
if ( left > PAGE_SIZE - 1 )
left = PAGE_SIZE - 1 ;
left - = proc_skip_spaces ( & p ) ;
if ( ! left ) {
err = - EINVAL ;
goto out_free ;
}
err = proc_get_long ( & p , & left , & lval , & neg ,
proc_wspace_sep ,
sizeof ( proc_wspace_sep ) , NULL ) ;
if ( err | | neg ) {
err = - EINVAL ;
goto out_free ;
}
if ( conv ( & lval , tbl_data , 1 , data ) ) {
err = - EINVAL ;
goto out_free ;
}
if ( ! err & & left )
left - = proc_skip_spaces ( & p ) ;
out_free :
if ( err )
return - EINVAL ;
return 0 ;
/* This is in keeping with old __do_proc_dointvec() */
bail_early :
* ppos + = * lenp ;
return err ;
}
2020-04-24 09:43:38 +03:00
static int do_proc_douintvec_r ( unsigned int * tbl_data , void * buffer ,
2017-07-13 00:33:36 +03:00
size_t * lenp , loff_t * ppos ,
int ( * conv ) ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data ) ,
void * data )
{
unsigned long lval ;
int err = 0 ;
size_t left ;
left = * lenp ;
if ( conv ( & lval , tbl_data , 0 , data ) ) {
err = - EINVAL ;
goto out ;
}
2020-04-24 09:43:38 +03:00
proc_put_long ( & buffer , & left , lval , false ) ;
if ( ! left )
2017-07-13 00:33:36 +03:00
goto out ;
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' \n ' ) ;
2017-07-13 00:33:36 +03:00
out :
* lenp - = left ;
* ppos + = * lenp ;
return err ;
}
static int __do_proc_douintvec ( void * tbl_data , struct ctl_table * table ,
2020-04-24 09:43:38 +03:00
int write , void * buffer ,
2017-07-13 00:33:36 +03:00
size_t * lenp , loff_t * ppos ,
int ( * conv ) ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data ) ,
void * data )
{
unsigned int * i , vleft ;
if ( ! tbl_data | | ! table - > maxlen | | ! * lenp | | ( * ppos & & ! write ) ) {
* lenp = 0 ;
return 0 ;
}
i = ( unsigned int * ) tbl_data ;
vleft = table - > maxlen / sizeof ( * i ) ;
/*
* Arrays are not supported , keep this simple . * Do not * add
* support for them .
*/
if ( vleft ! = 1 ) {
* lenp = 0 ;
return - EINVAL ;
}
if ( ! conv )
conv = do_proc_douintvec_conv ;
if ( write )
return do_proc_douintvec_w ( i , table , buffer , lenp , ppos ,
conv , data ) ;
return do_proc_douintvec_r ( i , buffer , lenp , ppos , conv , data ) ;
}
2022-01-22 09:13:20 +03:00
int do_proc_douintvec ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos ,
int ( * conv ) ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data ) ,
void * data )
2017-07-13 00:33:36 +03:00
{
return __do_proc_douintvec ( table - > data , table , write ,
buffer , lenp , ppos , conv , data ) ;
}
2021-08-03 13:59:36 +03:00
/**
* proc_dobool - read / write a bool
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
* values from / to the user buffer , treated as an ASCII string .
*
* Returns 0 on success .
*/
int proc_dobool ( struct ctl_table * table , int write , void * buffer ,
size_t * lenp , loff_t * ppos )
{
return do_proc_dointvec ( table , write , buffer , lenp , ppos ,
do_proc_dobool_conv , NULL ) ;
}
2005-04-17 02:20:36 +04:00
/**
* proc_dointvec - read a vector of integers
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
* values from / to the user buffer , treated as an ASCII string .
*
* Returns 0 on success .
*/
2020-04-24 09:43:38 +03:00
int proc_dointvec ( struct ctl_table * table , int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2016-08-26 01:16:51 +03:00
return do_proc_dointvec ( table , write , buffer , lenp , ppos , NULL , NULL ) ;
}
2020-04-02 07:10:42 +03:00
# ifdef CONFIG_COMPACTION
static int proc_dointvec_minmax_warn_RT_change ( struct ctl_table * table ,
2020-04-24 09:43:38 +03:00
int write , void * buffer , size_t * lenp , loff_t * ppos )
2020-04-02 07:10:42 +03:00
{
int ret , old ;
if ( ! IS_ENABLED ( CONFIG_PREEMPT_RT ) | | ! write )
return proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
old = * ( int * ) table - > data ;
ret = proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
if ( ret )
return ret ;
if ( old ! = * ( int * ) table - > data )
pr_warn_once ( " sysctl attribute %s changed by %s[%d] \n " ,
table - > procname , current - > comm ,
task_pid_nr ( current ) ) ;
return ret ;
}
# endif
2016-08-26 01:16:51 +03:00
/**
* proc_douintvec - read a vector of unsigned integers
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) unsigned integer
* values from / to the user buffer , treated as an ASCII string .
*
* Returns 0 on success .
*/
2020-04-24 09:43:38 +03:00
int proc_douintvec ( struct ctl_table * table , int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2016-08-26 01:16:51 +03:00
{
2017-07-13 00:33:36 +03:00
return do_proc_douintvec ( table , write , buffer , lenp , ppos ,
do_proc_douintvec_conv , NULL ) ;
2005-04-17 02:20:36 +04:00
}
2007-02-10 12:45:24 +03:00
/*
2008-10-16 09:01:41 +04:00
* Taint values can only be increased
* This means we can safely use a temporary .
2007-02-10 12:45:24 +03:00
*/
2009-09-24 02:57:19 +04:00
static int proc_taint ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2007-02-10 12:45:24 +03:00
{
2008-10-16 09:01:41 +04:00
struct ctl_table t ;
unsigned long tmptaint = get_taint ( ) ;
int err ;
2007-02-10 12:45:24 +03:00
2007-04-24 01:41:14 +04:00
if ( write & & ! capable ( CAP_SYS_ADMIN ) )
2007-02-10 12:45:24 +03:00
return - EPERM ;
2008-10-16 09:01:41 +04:00
t = * table ;
t . data = & tmptaint ;
2009-09-24 02:57:19 +04:00
err = proc_doulongvec_minmax ( & t , write , buffer , lenp , ppos ) ;
2008-10-16 09:01:41 +04:00
if ( err < 0 )
return err ;
if ( write ) {
2020-06-08 07:40:17 +03:00
int i ;
/*
* If we are relying on panic_on_taint not producing
* false positives due to userspace input , bail out
* before setting the requested taint flags .
*/
if ( panic_on_taint_nousertaint & & ( tmptaint & panic_on_taint ) )
return - EINVAL ;
2008-10-16 09:01:41 +04:00
/*
* Poor man ' s atomic or . Not worth adding a primitive
* to everyone ' s atomic . h for this
*/
2020-06-08 07:40:51 +03:00
for ( i = 0 ; i < TAINT_FLAGS_COUNT ; i + + )
if ( ( 1UL < < i ) & tmptaint )
2013-01-21 10:47:39 +04:00
add_taint ( i , LOCKDEP_STILL_OK ) ;
2008-10-16 09:01:41 +04:00
}
return err ;
2007-02-10 12:45:24 +03:00
}
2018-04-11 02:35:38 +03:00
/**
* struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax ( ) range checking structure
* @ min : pointer to minimum allowable value
* @ max : pointer to maximum allowable value
*
* The do_proc_dointvec_minmax_conv_param structure provides the
* minimum and maximum values for doing range checking for those sysctl
* parameters that use the proc_dointvec_minmax ( ) handler .
*/
2005-04-17 02:20:36 +04:00
struct do_proc_dointvec_minmax_conv_param {
int * min ;
int * max ;
} ;
2010-05-05 04:26:45 +04:00
static int do_proc_dointvec_minmax_conv ( bool * negp , unsigned long * lvalp ,
int * valp ,
2005-04-17 02:20:36 +04:00
int write , void * data )
{
2019-03-12 09:28:06 +03:00
int tmp , ret ;
2005-04-17 02:20:36 +04:00
struct do_proc_dointvec_minmax_conv_param * param = data ;
2019-03-12 09:28:06 +03:00
/*
* If writing , first do so via a temporary local int so we can
* bounds - check it before touching * valp .
*/
int * ip = write ? & tmp : valp ;
ret = do_proc_dointvec_conv ( negp , lvalp , ip , write , data ) ;
if ( ret )
return ret ;
2005-04-17 02:20:36 +04:00
if ( write ) {
2019-03-12 09:28:06 +03:00
if ( ( param - > min & & * param - > min > tmp ) | |
( param - > max & & * param - > max < tmp ) )
2005-04-17 02:20:36 +04:00
return - EINVAL ;
2022-07-07 02:39:54 +03:00
WRITE_ONCE ( * valp , tmp ) ;
2005-04-17 02:20:36 +04:00
}
2019-03-12 09:28:06 +03:00
2005-04-17 02:20:36 +04:00
return 0 ;
}
/**
* proc_dointvec_minmax - read a vector of integers with min / max values
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
* values from / to the user buffer , treated as an ASCII string .
*
* This routine will ensure the values are within the range specified by
* table - > extra1 ( min ) and table - > extra2 ( max ) .
*
2018-04-11 02:35:38 +03:00
* Returns 0 on success or - EINVAL on write when the range check fails .
2005-04-17 02:20:36 +04:00
*/
2009-09-24 02:57:19 +04:00
int proc_dointvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
struct do_proc_dointvec_minmax_conv_param param = {
. min = ( int * ) table - > extra1 ,
. max = ( int * ) table - > extra2 ,
} ;
2009-09-24 02:57:19 +04:00
return do_proc_dointvec ( table , write , buffer , lenp , ppos ,
2005-04-17 02:20:36 +04:00
do_proc_dointvec_minmax_conv , & param ) ;
}
2018-04-11 02:35:38 +03:00
/**
* struct do_proc_douintvec_minmax_conv_param - proc_douintvec_minmax ( ) range checking structure
* @ min : pointer to minimum allowable value
* @ max : pointer to maximum allowable value
*
* The do_proc_douintvec_minmax_conv_param structure provides the
* minimum and maximum values for doing range checking for those sysctl
* parameters that use the proc_douintvec_minmax ( ) handler .
*/
2017-07-13 00:33:40 +03:00
struct do_proc_douintvec_minmax_conv_param {
unsigned int * min ;
unsigned int * max ;
} ;
static int do_proc_douintvec_minmax_conv ( unsigned long * lvalp ,
unsigned int * valp ,
int write , void * data )
{
2019-03-12 09:28:06 +03:00
int ret ;
unsigned int tmp ;
2017-07-13 00:33:40 +03:00
struct do_proc_douintvec_minmax_conv_param * param = data ;
2019-03-12 09:28:06 +03:00
/* write via temporary local uint for bounds-checking */
unsigned int * up = write ? & tmp : valp ;
2017-07-13 00:33:40 +03:00
2019-03-12 09:28:06 +03:00
ret = do_proc_douintvec_conv ( lvalp , up , write , data ) ;
if ( ret )
return ret ;
2017-11-18 02:29:28 +03:00
2019-03-12 09:28:06 +03:00
if ( write ) {
if ( ( param - > min & & * param - > min > tmp ) | |
( param - > max & & * param - > max < tmp ) )
2017-07-13 00:33:40 +03:00
return - ERANGE ;
2022-07-07 02:39:55 +03:00
WRITE_ONCE ( * valp , tmp ) ;
2017-07-13 00:33:40 +03:00
}
return 0 ;
}
/**
* proc_douintvec_minmax - read a vector of unsigned ints with min / max values
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) unsigned integer
* values from / to the user buffer , treated as an ASCII string . Negative
* strings are not allowed .
*
* This routine will ensure the values are within the range specified by
* table - > extra1 ( min ) and table - > extra2 ( max ) . There is a final sanity
* check for UINT_MAX to avoid having to support wrap around uses from
* userspace .
*
2018-04-11 02:35:38 +03:00
* Returns 0 on success or - ERANGE on write when the range check fails .
2017-07-13 00:33:40 +03:00
*/
int proc_douintvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2017-07-13 00:33:40 +03:00
{
struct do_proc_douintvec_minmax_conv_param param = {
. min = ( unsigned int * ) table - > extra1 ,
. max = ( unsigned int * ) table - > extra2 ,
} ;
return do_proc_douintvec ( table , write , buffer , lenp , ppos ,
do_proc_douintvec_minmax_conv , & param ) ;
}
2021-03-25 21:08:13 +03:00
/**
* proc_dou8vec_minmax - read a vector of unsigned chars with min / max values
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( u8 ) unsigned chars
* values from / to the user buffer , treated as an ASCII string . Negative
* strings are not allowed .
*
* This routine will ensure the values are within the range specified by
* table - > extra1 ( min ) and table - > extra2 ( max ) .
*
* Returns 0 on success or an error on write when the range check fails .
*/
int proc_dou8vec_minmax ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
struct ctl_table tmp ;
unsigned int min = 0 , max = 255U , val ;
u8 * data = table - > data ;
struct do_proc_douintvec_minmax_conv_param param = {
. min = & min ,
. max = & max ,
} ;
int res ;
/* Do not support arrays yet. */
if ( table - > maxlen ! = sizeof ( u8 ) )
return - EINVAL ;
if ( table - > extra1 ) {
min = * ( unsigned int * ) table - > extra1 ;
if ( min > 255U )
return - EINVAL ;
}
if ( table - > extra2 ) {
max = * ( unsigned int * ) table - > extra2 ;
if ( max > 255U )
return - EINVAL ;
}
tmp = * table ;
tmp . maxlen = sizeof ( val ) ;
tmp . data = & val ;
2022-07-12 03:15:19 +03:00
val = READ_ONCE ( * data ) ;
2021-03-25 21:08:13 +03:00
res = do_proc_douintvec ( & tmp , write , buffer , lenp , ppos ,
do_proc_douintvec_minmax_conv , & param ) ;
if ( res )
return res ;
if ( write )
2022-07-12 03:15:19 +03:00
WRITE_ONCE ( * data , val ) ;
2021-03-25 21:08:13 +03:00
return 0 ;
}
EXPORT_SYMBOL_GPL ( proc_dou8vec_minmax ) ;
2020-03-02 20:51:34 +03:00
# ifdef CONFIG_MAGIC_SYSRQ
static int sysrq_sysctl_handler ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-03-02 20:51:34 +03:00
{
int tmp , ret ;
tmp = sysrq_mask ( ) ;
ret = __do_proc_dointvec ( & tmp , table , write , buffer ,
lenp , ppos , NULL , NULL ) ;
if ( ret | | ! write )
return ret ;
if ( write )
sysrq_toggle_support ( tmp ) ;
return 0 ;
}
# endif
2020-04-24 09:43:38 +03:00
static int __do_proc_doulongvec_minmax ( void * data , struct ctl_table * table ,
int write , void * buffer , size_t * lenp , loff_t * ppos ,
unsigned long convmul , unsigned long convdiv )
2005-04-17 02:20:36 +04:00
{
2010-05-05 04:26:45 +04:00
unsigned long * i , * min , * max ;
int vleft , first = 1 , err = 0 ;
size_t left ;
2020-04-24 09:43:38 +03:00
char * p ;
2010-05-05 04:26:45 +04:00
if ( ! data | | ! table - > maxlen | | ! * lenp | | ( * ppos & & ! write ) ) {
2005-04-17 02:20:36 +04:00
* lenp = 0 ;
return 0 ;
}
2010-05-05 04:26:45 +04:00
2006-10-02 13:18:23 +04:00
i = ( unsigned long * ) data ;
2005-04-17 02:20:36 +04:00
min = ( unsigned long * ) table - > extra1 ;
max = ( unsigned long * ) table - > extra2 ;
vleft = table - > maxlen / sizeof ( unsigned long ) ;
left = * lenp ;
2010-05-05 04:26:45 +04:00
if ( write ) {
2017-07-13 00:33:33 +03:00
if ( proc_first_pos_non_zero_ignore ( ppos , table ) )
goto out ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
2010-05-05 04:26:45 +04:00
if ( left > PAGE_SIZE - 1 )
left = PAGE_SIZE - 1 ;
2020-04-24 09:43:38 +03:00
p = buffer ;
2010-05-05 04:26:45 +04:00
}
2010-10-07 23:59:29 +04:00
for ( ; left & & vleft - - ; i + + , first = 0 ) {
2010-05-05 04:26:45 +04:00
unsigned long val ;
2005-04-17 02:20:36 +04:00
if ( write ) {
2010-05-05 04:26:45 +04:00
bool neg ;
2015-12-24 08:13:10 +03:00
left - = proc_skip_spaces ( & p ) ;
proc/sysctl: fix return error for proc_doulongvec_minmax()
If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.
For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:
{
.procname = "monitor_signals",
.data = &monitor_sigs,
.maxlen = 2*sizeof(unsigned long),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
Reproduce:
When passing two parameters, it's work normal. But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
-bash: echo: write error: Invalid argument
[root@cl150 ~]# echo $?
1
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
The following is the result after apply this patch. No error is
returned when the number of input parameters is less than the total
parameters.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# echo $?
0
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.
This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'. In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.
static int __do_proc_douintvec(...){
...
/*
* Arrays are not supported, keep this simple. *Do not* add
* support for them.
*/
if (vleft != 1) {
*lenp = 0;
return -EINVAL;
}
...
}
So, just __do_proc_doulongvec_minmax has the problem. And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.
Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 02:26:13 +03:00
if ( ! left )
break ;
2010-05-05 04:26:45 +04:00
2015-12-24 08:13:10 +03:00
err = proc_get_long ( & p , & left , & val , & neg ,
2010-05-05 04:26:45 +04:00
proc_wspace_sep ,
sizeof ( proc_wspace_sep ) , NULL ) ;
2022-01-22 09:13:48 +03:00
if ( err | | neg ) {
err = - EINVAL ;
2005-04-17 02:20:36 +04:00
break ;
2022-01-22 09:13:48 +03:00
}
2017-01-26 05:20:55 +03:00
val = convmul * val / convdiv ;
2019-05-15 01:44:55 +03:00
if ( ( min & & val < * min ) | | ( max & & val > * max ) ) {
err = - EINVAL ;
break ;
}
2022-07-07 02:39:56 +03:00
WRITE_ONCE ( * i , val ) ;
2005-04-17 02:20:36 +04:00
} else {
2022-07-07 02:39:56 +03:00
val = convdiv * READ_ONCE ( * i ) / convmul ;
2020-04-24 09:43:38 +03:00
if ( ! first )
proc_put_char ( & buffer , & left , ' \t ' ) ;
proc_put_long ( & buffer , & left , val , false ) ;
2005-04-17 02:20:36 +04:00
}
}
2010-05-05 04:26:45 +04:00
if ( ! write & & ! first & & left & & ! err )
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' \n ' ) ;
2010-05-05 04:26:45 +04:00
if ( write & & ! err )
2015-12-24 08:13:10 +03:00
left - = proc_skip_spaces ( & p ) ;
2020-04-24 09:43:38 +03:00
if ( write & & first )
return err ? : - EINVAL ;
2005-04-17 02:20:36 +04:00
* lenp - = left ;
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07 01:37:19 +04:00
out :
2005-04-17 02:20:36 +04:00
* ppos + = * lenp ;
2010-05-05 04:26:45 +04:00
return err ;
2005-04-17 02:20:36 +04:00
}
2007-10-18 14:05:22 +04:00
static int do_proc_doulongvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos , unsigned long convmul ,
unsigned long convdiv )
2006-10-02 13:18:23 +04:00
{
return __do_proc_doulongvec_minmax ( table - > data , table , write ,
2009-09-24 02:57:19 +04:00
buffer , lenp , ppos , convmul , convdiv ) ;
2006-10-02 13:18:23 +04:00
}
2005-04-17 02:20:36 +04:00
/**
* proc_doulongvec_minmax - read a vector of long integers with min / max values
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned long ) unsigned long
* values from / to the user buffer , treated as an ASCII string .
*
* This routine will ensure the values are within the range specified by
* table - > extra1 ( min ) and table - > extra2 ( max ) .
*
* Returns 0 on success .
*/
2009-09-24 02:57:19 +04:00
int proc_doulongvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2009-09-24 02:57:19 +04:00
return do_proc_doulongvec_minmax ( table , write , buffer , lenp , ppos , 1l , 1l ) ;
2005-04-17 02:20:36 +04:00
}
/**
* proc_doulongvec_ms_jiffies_minmax - read a vector of millisecond values with min / max values
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned long ) unsigned long
* values from / to the user buffer , treated as an ASCII string . The values
* are treated as milliseconds , and converted to jiffies when they are stored .
*
* This routine will ensure the values are within the range specified by
* table - > extra1 ( min ) and table - > extra2 ( max ) .
*
* Returns 0 on success .
*/
2007-10-18 14:05:22 +04:00
int proc_doulongvec_ms_jiffies_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2009-09-24 02:57:19 +04:00
return do_proc_doulongvec_minmax ( table , write , buffer ,
2005-04-17 02:20:36 +04:00
lenp , ppos , HZ , 1000l ) ;
}
2010-05-05 04:26:45 +04:00
static int do_proc_dointvec_jiffies_conv ( bool * negp , unsigned long * lvalp ,
2005-04-17 02:20:36 +04:00
int * valp ,
int write , void * data )
{
if ( write ) {
2017-05-09 01:54:58 +03:00
if ( * lvalp > INT_MAX / HZ )
2006-03-24 14:15:50 +03:00
return 1 ;
2022-07-07 02:39:57 +03:00
if ( * negp )
WRITE_ONCE ( * valp , - * lvalp * HZ ) ;
else
WRITE_ONCE ( * valp , * lvalp * HZ ) ;
2005-04-17 02:20:36 +04:00
} else {
2022-07-07 02:39:57 +03:00
int val = READ_ONCE ( * valp ) ;
2005-04-17 02:20:36 +04:00
unsigned long lval ;
if ( val < 0 ) {
2010-05-05 04:26:45 +04:00
* negp = true ;
2015-09-10 01:39:06 +03:00
lval = - ( unsigned long ) val ;
2005-04-17 02:20:36 +04:00
} else {
2010-05-05 04:26:45 +04:00
* negp = false ;
2005-04-17 02:20:36 +04:00
lval = ( unsigned long ) val ;
}
* lvalp = lval / HZ ;
}
return 0 ;
}
2010-05-05 04:26:45 +04:00
static int do_proc_dointvec_userhz_jiffies_conv ( bool * negp , unsigned long * lvalp ,
2005-04-17 02:20:36 +04:00
int * valp ,
int write , void * data )
{
if ( write ) {
2006-03-24 14:15:50 +03:00
if ( USER_HZ < HZ & & * lvalp > ( LONG_MAX / HZ ) * USER_HZ )
return 1 ;
2005-04-17 02:20:36 +04:00
* valp = clock_t_to_jiffies ( * negp ? - * lvalp : * lvalp ) ;
} else {
int val = * valp ;
unsigned long lval ;
if ( val < 0 ) {
2010-05-05 04:26:45 +04:00
* negp = true ;
2015-09-10 01:39:06 +03:00
lval = - ( unsigned long ) val ;
2005-04-17 02:20:36 +04:00
} else {
2010-05-05 04:26:45 +04:00
* negp = false ;
2005-04-17 02:20:36 +04:00
lval = ( unsigned long ) val ;
}
* lvalp = jiffies_to_clock_t ( lval ) ;
}
return 0 ;
}
2010-05-05 04:26:45 +04:00
static int do_proc_dointvec_ms_jiffies_conv ( bool * negp , unsigned long * lvalp ,
2005-04-17 02:20:36 +04:00
int * valp ,
int write , void * data )
{
if ( write ) {
2013-07-24 12:39:07 +04:00
unsigned long jif = msecs_to_jiffies ( * negp ? - * lvalp : * lvalp ) ;
if ( jif > INT_MAX )
return 1 ;
2022-07-12 03:15:20 +03:00
WRITE_ONCE ( * valp , ( int ) jif ) ;
2005-04-17 02:20:36 +04:00
} else {
2022-07-12 03:15:20 +03:00
int val = READ_ONCE ( * valp ) ;
2005-04-17 02:20:36 +04:00
unsigned long lval ;
if ( val < 0 ) {
2010-05-05 04:26:45 +04:00
* negp = true ;
2015-09-10 01:39:06 +03:00
lval = - ( unsigned long ) val ;
2005-04-17 02:20:36 +04:00
} else {
2010-05-05 04:26:45 +04:00
* negp = false ;
2005-04-17 02:20:36 +04:00
lval = ( unsigned long ) val ;
}
* lvalp = jiffies_to_msecs ( lval ) ;
}
return 0 ;
}
/**
* proc_dointvec_jiffies - read a vector of integers as seconds
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
* values from / to the user buffer , treated as an ASCII string .
* The values read are assumed to be in seconds , and are converted into
* jiffies .
*
* Returns 0 on success .
*/
2009-09-24 02:57:19 +04:00
int proc_dointvec_jiffies ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2009-09-24 02:57:19 +04:00
return do_proc_dointvec ( table , write , buffer , lenp , ppos ,
2005-04-17 02:20:36 +04:00
do_proc_dointvec_jiffies_conv , NULL ) ;
}
/**
* proc_dointvec_userhz_jiffies - read a vector of integers as 1 / USER_HZ seconds
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
2005-11-07 12:01:06 +03:00
* @ ppos : pointer to the file position
2005-04-17 02:20:36 +04:00
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
* values from / to the user buffer , treated as an ASCII string .
* The values read are assumed to be in 1 / USER_HZ seconds , and
* are converted into jiffies .
*
* Returns 0 on success .
*/
2009-09-24 02:57:19 +04:00
int proc_dointvec_userhz_jiffies ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2009-09-24 02:57:19 +04:00
return do_proc_dointvec ( table , write , buffer , lenp , ppos ,
2005-04-17 02:20:36 +04:00
do_proc_dointvec_userhz_jiffies_conv , NULL ) ;
}
/**
* proc_dointvec_ms_jiffies - read a vector of integers as 1 milliseconds
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
2005-05-01 19:59:26 +04:00
* @ ppos : file position
* @ ppos : the current position in the file
2005-04-17 02:20:36 +04:00
*
* Reads / writes up to table - > maxlen / sizeof ( unsigned int ) integer
2022-07-12 03:15:20 +03:00
* values from / to the user buffer , treated as an ASCII string .
* The values read are assumed to be in 1 / 1000 seconds , and
2005-04-17 02:20:36 +04:00
* are converted into jiffies .
*
* Returns 0 on success .
*/
2020-04-24 09:43:38 +03:00
int proc_dointvec_ms_jiffies ( struct ctl_table * table , int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2009-09-24 02:57:19 +04:00
return do_proc_dointvec ( table , write , buffer , lenp , ppos ,
2005-04-17 02:20:36 +04:00
do_proc_dointvec_ms_jiffies_conv , NULL ) ;
}
2020-04-24 09:43:38 +03:00
static int proc_do_cad_pid ( struct ctl_table * table , int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2006-10-02 13:19:00 +04:00
{
struct pid * new_pid ;
pid_t tmp ;
int r ;
2008-02-08 15:19:20 +03:00
tmp = pid_vnr ( cad_pid ) ;
2006-10-02 13:19:00 +04:00
2009-09-24 02:57:19 +04:00
r = __do_proc_dointvec ( & tmp , table , write , buffer ,
2006-10-02 13:19:00 +04:00
lenp , ppos , NULL , NULL ) ;
if ( r | | ! write )
return r ;
new_pid = find_get_pid ( tmp ) ;
if ( ! new_pid )
return - ESRCH ;
put_pid ( xchg ( & cad_pid , new_pid ) ) ;
return 0 ;
}
2010-05-05 04:26:55 +04:00
/**
* proc_do_large_bitmap - read / write from / to a large bitmap
* @ table : the sysctl table
* @ write : % TRUE if this is a write to the sysctl file
* @ buffer : the user buffer
* @ lenp : the size of the user buffer
* @ ppos : file position
*
* The bitmap is stored at table - > data and the bitmap length ( in bits )
* in table - > maxlen .
*
* We use a range comma separated format ( e . g . 1 , 3 - 4 , 10 - 10 ) so that
* large bitmaps may be represented in a compact manner . Writing into
* the file will clear the bitmap then update it with the given input .
*
* Returns 0 on success .
*/
int proc_do_large_bitmap ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2010-05-05 04:26:55 +04:00
{
int err = 0 ;
size_t left = * lenp ;
unsigned long bitmap_len = table - > maxlen ;
2014-05-13 03:04:53 +04:00
unsigned long * bitmap = * ( unsigned long * * ) table - > data ;
2010-05-05 04:26:55 +04:00
unsigned long * tmp_bitmap = NULL ;
char tr_a [ ] = { ' - ' , ' , ' , ' \n ' } , tr_b [ ] = { ' , ' , ' \n ' , 0 } , c ;
2014-05-13 03:04:53 +04:00
if ( ! bitmap | | ! bitmap_len | | ! left | | ( * ppos & & ! write ) ) {
2010-05-05 04:26:55 +04:00
* lenp = 0 ;
return 0 ;
}
if ( write ) {
2020-04-24 09:43:38 +03:00
char * p = buffer ;
2019-05-15 01:45:13 +03:00
size_t skipped = 0 ;
2010-05-05 04:26:55 +04:00
2019-05-15 01:45:13 +03:00
if ( left > PAGE_SIZE - 1 ) {
2010-05-05 04:26:55 +04:00
left = PAGE_SIZE - 1 ;
2019-05-15 01:45:13 +03:00
/* How much of the buffer we'll skip this pass */
skipped = * lenp - left ;
}
2010-05-05 04:26:55 +04:00
2019-05-15 01:44:52 +03:00
tmp_bitmap = bitmap_zalloc ( bitmap_len , GFP_KERNEL ) ;
2020-04-24 09:43:38 +03:00
if ( ! tmp_bitmap )
2010-05-05 04:26:55 +04:00
return - ENOMEM ;
2015-12-24 08:13:10 +03:00
proc_skip_char ( & p , & left , ' \n ' ) ;
2010-05-05 04:26:55 +04:00
while ( ! err & & left ) {
unsigned long val_a , val_b ;
bool neg ;
2019-05-15 01:45:13 +03:00
size_t saved_left ;
2010-05-05 04:26:55 +04:00
2019-05-15 01:45:13 +03:00
/* In case we stop parsing mid-number, we can reset */
saved_left = left ;
2015-12-24 08:13:10 +03:00
err = proc_get_long ( & p , & left , & val_a , & neg , tr_a ,
2010-05-05 04:26:55 +04:00
sizeof ( tr_a ) , & c ) ;
2019-05-15 01:45:13 +03:00
/*
* If we consumed the entirety of a truncated buffer or
* only one char is left ( may be a " - " ) , then stop here ,
* reset , & come back for more .
*/
if ( ( left < = 1 ) & & skipped ) {
left = saved_left ;
break ;
}
2010-05-05 04:26:55 +04:00
if ( err )
break ;
if ( val_a > = bitmap_len | | neg ) {
err = - EINVAL ;
break ;
}
val_b = val_a ;
if ( left ) {
2015-12-24 08:13:10 +03:00
p + + ;
2010-05-05 04:26:55 +04:00
left - - ;
}
if ( c = = ' - ' ) {
2015-12-24 08:13:10 +03:00
err = proc_get_long ( & p , & left , & val_b ,
2010-05-05 04:26:55 +04:00
& neg , tr_b , sizeof ( tr_b ) ,
& c ) ;
2019-05-15 01:45:13 +03:00
/*
* If we consumed all of a truncated buffer or
* then stop here , reset , & come back for more .
*/
if ( ! left & & skipped ) {
left = saved_left ;
break ;
}
2010-05-05 04:26:55 +04:00
if ( err )
break ;
if ( val_b > = bitmap_len | | neg | |
val_a > val_b ) {
err = - EINVAL ;
break ;
}
if ( left ) {
2015-12-24 08:13:10 +03:00
p + + ;
2010-05-05 04:26:55 +04:00
left - - ;
}
}
2012-03-29 01:42:50 +04:00
bitmap_set ( tmp_bitmap , val_a , val_b - val_a + 1 ) ;
2015-12-24 08:13:10 +03:00
proc_skip_char ( & p , & left , ' \n ' ) ;
2010-05-05 04:26:55 +04:00
}
2019-05-15 01:45:13 +03:00
left + = skipped ;
2010-05-05 04:26:55 +04:00
} else {
unsigned long bit_a , bit_b = 0 ;
2021-07-01 04:54:53 +03:00
bool first = 1 ;
2010-05-05 04:26:55 +04:00
while ( left ) {
bit_a = find_next_bit ( bitmap , bitmap_len , bit_b ) ;
if ( bit_a > = bitmap_len )
break ;
bit_b = find_next_zero_bit ( bitmap , bitmap_len ,
bit_a + 1 ) - 1 ;
2020-04-24 09:43:38 +03:00
if ( ! first )
proc_put_char ( & buffer , & left , ' , ' ) ;
proc_put_long ( & buffer , & left , bit_a , false ) ;
2010-05-05 04:26:55 +04:00
if ( bit_a ! = bit_b ) {
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' - ' ) ;
proc_put_long ( & buffer , & left , bit_b , false ) ;
2010-05-05 04:26:55 +04:00
}
first = 0 ; bit_b + + ;
}
2020-04-24 09:43:38 +03:00
proc_put_char ( & buffer , & left , ' \n ' ) ;
2010-05-05 04:26:55 +04:00
}
if ( ! err ) {
if ( write ) {
if ( * ppos )
bitmap_or ( bitmap , bitmap , tmp_bitmap , bitmap_len ) ;
else
2012-03-29 01:42:50 +04:00
bitmap_copy ( bitmap , tmp_bitmap , bitmap_len ) ;
2010-05-05 04:26:55 +04:00
}
* lenp - = left ;
* ppos + = * lenp ;
}
2017-11-18 02:30:26 +03:00
2019-05-15 01:44:52 +03:00
bitmap_free ( tmp_bitmap ) ;
2017-11-18 02:30:26 +03:00
return err ;
2010-05-05 04:26:55 +04:00
}
2011-01-13 04:00:45 +03:00
# else /* CONFIG_PROC_SYSCTL */
2005-04-17 02:20:36 +04:00
2009-09-24 02:57:19 +04:00
int proc_dostring ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
2021-08-03 13:59:36 +03:00
int proc_dobool ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
int proc_dointvec ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-04-24 09:43:37 +03:00
{
return - ENOSYS ;
}
int proc_douintvec ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-04-24 09:43:37 +03:00
{
return - ENOSYS ;
}
int proc_dointvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-04-24 09:43:37 +03:00
{
return - ENOSYS ;
}
int proc_douintvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-04-24 09:43:37 +03:00
{
return - ENOSYS ;
2010-05-05 04:26:55 +04:00
}
2021-03-25 21:08:13 +03:00
int proc_dou8vec_minmax ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
int proc_dointvec_jiffies ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2020-04-24 09:43:37 +03:00
{
return - ENOSYS ;
}
2005-04-17 02:20:36 +04:00
2020-04-24 09:43:37 +03:00
int proc_dointvec_userhz_jiffies ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
int proc_dointvec_ms_jiffies ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
int proc_doulongvec_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2016-08-26 01:16:51 +03:00
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
int proc_doulongvec_ms_jiffies_minmax ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2020-04-24 09:43:38 +03:00
return - ENOSYS ;
2005-04-17 02:20:36 +04:00
}
2020-04-24 09:43:37 +03:00
int proc_do_large_bitmap ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
return - ENOSYS ;
}
2020-04-24 09:43:37 +03:00
# endif /* CONFIG_PROC_SYSCTL */
# if defined(CONFIG_SYSCTL)
int proc_do_static_key ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-04-17 02:20:36 +04:00
{
2020-04-24 09:43:37 +03:00
struct static_key * key = ( struct static_key * ) table - > data ;
static DEFINE_MUTEX ( static_key_mutex ) ;
int val , ret ;
struct ctl_table tmp = {
. data = & val ,
. maxlen = sizeof ( val ) ,
. mode = table - > mode ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ;
if ( write & & ! capable ( CAP_SYS_ADMIN ) )
return - EPERM ;
mutex_lock ( & static_key_mutex ) ;
val = static_key_enabled ( key ) ;
ret = proc_dointvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
if ( write & & ! ret ) {
if ( val )
static_key_enable ( key ) ;
else
static_key_disable ( key ) ;
}
mutex_unlock ( & static_key_mutex ) ;
return ret ;
2005-04-17 02:20:36 +04:00
}
2020-04-24 09:43:37 +03:00
static struct ctl_table kern_table [ ] = {
2021-03-24 16:39:16 +03:00
# ifdef CONFIG_NUMA_BALANCING
2020-04-24 09:43:37 +03:00
{
. procname = " numa_balancing " ,
. data = NULL , /* filled in by handler */
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
. proc_handler = sysctl_numa_balancing ,
. extra1 = SYSCTL_ZERO ,
NUMA balancing: optimize page placement for memory tiering system
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.
In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.
In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.
The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.
It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,
a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.
b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.
The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.
The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.
In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.
The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,
- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING
We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.
Thanks Andrew Morton to help fix the document format error.
Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23 00:46:23 +03:00
. extra2 = SYSCTL_FOUR ,
2020-04-24 09:43:37 +03:00
} ,
# endif /* CONFIG_NUMA_BALANCING */
{
. procname = " panic " ,
. data = & panic_timeout ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# ifdef CONFIG_PROC_SYSCTL
{
. procname = " tainted " ,
. maxlen = sizeof ( long ) ,
. mode = 0644 ,
. proc_handler = proc_taint ,
} ,
{
. procname = " sysctl_writes_strict " ,
. data = & sysctl_writes_strict ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
2022-01-22 09:10:55 +03:00
. extra1 = SYSCTL_NEG_ONE ,
2020-04-24 09:43:37 +03:00
. extra2 = SYSCTL_ONE ,
} ,
# endif
{
. procname = " print-fatal-signals " ,
. data = & print_fatal_signals ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# ifdef CONFIG_SPARC
{
. procname = " reboot-cmd " ,
. data = reboot_command ,
. maxlen = 256 ,
. mode = 0644 ,
. proc_handler = proc_dostring ,
} ,
{
. procname = " stop-a " ,
. data = & stop_a_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " scons-poweroff " ,
. data = & scons_pwroff ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_SPARC64
{
. procname = " tsb-ratio " ,
. data = & sysctl_tsb_ratio ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_PARISC
{
. procname = " soft-power " ,
. data = & pwrsw_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW
{
. procname = " unaligned-trap " ,
. data = & unaligned_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_STACK_TRACER
{
. procname = " stack_tracer_enabled " ,
. data = & stack_tracer_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = stack_trace_sysctl ,
} ,
# endif
# ifdef CONFIG_TRACING
{
. procname = " ftrace_dump_on_oops " ,
. data = & ftrace_dump_on_oops ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " traceoff_on_warning " ,
. data = & __disable_trace_on_warning ,
. maxlen = sizeof ( __disable_trace_on_warning ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " tracepoint_printk " ,
. data = & tracepoint_printk ,
. maxlen = sizeof ( tracepoint_printk ) ,
. mode = 0644 ,
. proc_handler = tracepoint_printk_sysctl ,
} ,
# endif
# ifdef CONFIG_MODULES
{
. procname = " modprobe " ,
. data = & modprobe_path ,
. maxlen = KMOD_PATH_LEN ,
. mode = 0644 ,
. proc_handler = proc_dostring ,
} ,
{
. procname = " modules_disabled " ,
. data = & modules_disabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
/* only handle a transition from default "0" to "1" */
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ONE ,
. extra2 = SYSCTL_ONE ,
} ,
# endif
# ifdef CONFIG_UEVENT_HELPER
{
. procname = " hotplug " ,
. data = & uevent_helper ,
. maxlen = UEVENT_HELPER_PATH_LEN ,
. mode = 0644 ,
. proc_handler = proc_dostring ,
} ,
# endif
# ifdef CONFIG_MAGIC_SYSRQ
{
. procname = " sysrq " ,
. data = NULL ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = sysrq_sysctl_handler ,
} ,
# endif
# ifdef CONFIG_PROC_SYSCTL
{
. procname = " cad_pid " ,
. data = NULL ,
. maxlen = sizeof ( int ) ,
. mode = 0600 ,
. proc_handler = proc_do_cad_pid ,
} ,
# endif
{
. procname = " threads-max " ,
. data = NULL ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = sysctl_max_threads ,
} ,
{
. procname = " usermodehelper " ,
. mode = 0555 ,
. child = usermodehelper_table ,
} ,
{
. procname = " overflowuid " ,
. data = & overflowuid ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
2022-01-22 09:11:19 +03:00
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:13:03 +03:00
. extra2 = SYSCTL_MAXOLDUID ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " overflowgid " ,
. data = & overflowgid ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
2022-01-22 09:11:19 +03:00
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:13:03 +03:00
. extra2 = SYSCTL_MAXOLDUID ,
2020-04-24 09:43:37 +03:00
} ,
# ifdef CONFIG_S390
{
. procname = " userprocess_debug " ,
. data = & show_unhandled_signals ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
{
. procname = " pid_max " ,
. data = & pid_max ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & pid_max_min ,
. extra2 = & pid_max_max ,
} ,
{
. procname = " panic_on_oops " ,
. data = & panic_on_oops ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " panic_print " ,
. data = & panic_print ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
{
. procname = " ngroups_max " ,
2022-01-22 09:11:09 +03:00
. data = ( void * ) & ngroups_max ,
2020-04-24 09:43:37 +03:00
. maxlen = sizeof ( int ) ,
. mode = 0444 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " cap_last_cap " ,
. data = ( void * ) & cap_last_cap ,
. maxlen = sizeof ( int ) ,
. mode = 0444 ,
. proc_handler = proc_dointvec ,
} ,
# if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
{
. procname = " unknown_nmi_panic " ,
. data = & unknown_nmi_panic ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
2017-07-13 00:33:40 +03:00
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
2020-06-04 02:27:18 +03:00
# if (defined(CONFIG_X86_32) || defined(CONFIG_PARISC)) && \
defined ( CONFIG_DEBUG_STACKOVERFLOW )
2020-04-24 09:43:37 +03:00
{
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
2020-06-04 02:27:18 +03:00
. procname = " panic_on_stackoverflow " ,
. data = & sysctl_panic_on_stackoverflow ,
2020-04-24 09:43:37 +03:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
2020-06-04 02:27:18 +03:00
# endif
# if defined(CONFIG_X86)
2020-04-24 09:43:37 +03:00
{
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
2020-06-04 02:27:18 +03:00
. procname = " panic_on_unrecovered_nmi " ,
. data = & panic_on_unrecovered_nmi ,
2020-04-24 09:43:37 +03:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
2020-06-04 02:27:18 +03:00
. procname = " panic_on_io_nmi " ,
. data = & panic_on_io_nmi ,
2020-04-24 09:43:37 +03:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " bootloader_type " ,
. data = & bootloader_type ,
. maxlen = sizeof ( int ) ,
. mode = 0444 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " bootloader_version " ,
. data = & bootloader_version ,
. maxlen = sizeof ( int ) ,
. mode = 0444 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " io_delay_type " ,
. data = & io_delay_type ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# if defined(CONFIG_MMU)
{
. procname = " randomize_va_space " ,
. data = & randomize_va_space ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# if defined(CONFIG_S390) && defined(CONFIG_SMP)
{
. procname = " spin_retry " ,
. data = & spin_retry ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# if defined(CONFIG_ACPI_SLEEP) && defined(CONFIG_X86)
{
. procname = " acpi_video_flags " ,
. data = & acpi_realmode_flags ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
# endif
# ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
{
. procname = " ignore-unaligned-usertrap " ,
. data = & no_unaligned_warning ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_IA64
{
. procname = " unaligned-dump-stack " ,
. data = & unaligned_dump_stack ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_RT_MUTEXES
{
. procname = " max_lock_depth " ,
. data = & max_lock_depth ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
# endif
# ifdef CONFIG_KEYS
{
. procname = " keys " ,
. mode = 0555 ,
. child = key_sysctls ,
} ,
# endif
# ifdef CONFIG_PERF_EVENTS
/*
* User - space scripts rely on the existence of this file
* as a feature check for perf_events being enabled .
*
* So it ' s an ABI , do not remove !
*/
{
. procname = " perf_event_paranoid " ,
. data = & sysctl_perf_event_paranoid ,
. maxlen = sizeof ( sysctl_perf_event_paranoid ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " perf_event_mlock_kb " ,
. data = & sysctl_perf_event_mlock ,
. maxlen = sizeof ( sysctl_perf_event_mlock ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " perf_event_max_sample_rate " ,
. data = & sysctl_perf_event_sample_rate ,
. maxlen = sizeof ( sysctl_perf_event_sample_rate ) ,
. mode = 0644 ,
. proc_handler = perf_proc_update_handler ,
. extra1 = SYSCTL_ONE ,
} ,
{
. procname = " perf_cpu_time_max_percent " ,
. data = & sysctl_perf_cpu_time_max_percent ,
. maxlen = sizeof ( sysctl_perf_cpu_time_max_percent ) ,
. mode = 0644 ,
. proc_handler = perf_cpu_time_max_percent_handler ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_ONE_HUNDRED ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " perf_event_max_stack " ,
. data = & sysctl_perf_event_max_stack ,
. maxlen = sizeof ( sysctl_perf_event_max_stack ) ,
. mode = 0644 ,
. proc_handler = perf_event_max_stack_handler ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:11:14 +03:00
. extra2 = ( void * ) & six_hundred_forty_kb ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " perf_event_max_contexts_per_stack " ,
. data = & sysctl_perf_event_max_contexts_per_stack ,
. maxlen = sizeof ( sysctl_perf_event_max_contexts_per_stack ) ,
. mode = 0644 ,
. proc_handler = perf_event_max_stack_handler ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_ONE_THOUSAND ,
2020-04-24 09:43:37 +03:00
} ,
# endif
{
. procname = " panic_on_warn " ,
. data = & panic_on_warn ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
# if defined(CONFIG_TREE_RCU)
{
. procname = " panic_on_rcu_stall " ,
. data = & sysctl_panic_on_rcu_stall ,
. maxlen = sizeof ( sysctl_panic_on_rcu_stall ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
# endif
2020-08-31 09:41:17 +03:00
# if defined(CONFIG_TREE_RCU)
{
. procname = " max_rcu_stall_to_panic " ,
. data = & sysctl_max_rcu_stall_to_panic ,
. maxlen = sizeof ( sysctl_max_rcu_stall_to_panic ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ONE ,
. extra2 = SYSCTL_INT_MAX ,
} ,
2020-04-24 09:43:37 +03:00
# endif
{ }
} ;
2005-04-17 02:20:36 +04:00
2020-04-24 09:43:37 +03:00
static struct ctl_table vm_table [ ] = {
{
. procname = " overcommit_memory " ,
. data = & sysctl_overcommit_memory ,
. maxlen = sizeof ( sysctl_overcommit_memory ) ,
. mode = 0644 ,
2020-08-07 09:23:15 +03:00
. proc_handler = overcommit_policy_handler ,
2020-04-24 09:43:37 +03:00
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_TWO ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " overcommit_ratio " ,
. data = & sysctl_overcommit_ratio ,
. maxlen = sizeof ( sysctl_overcommit_ratio ) ,
. mode = 0644 ,
. proc_handler = overcommit_ratio_handler ,
} ,
{
. procname = " overcommit_kbytes " ,
. data = & sysctl_overcommit_kbytes ,
. maxlen = sizeof ( sysctl_overcommit_kbytes ) ,
. mode = 0644 ,
. proc_handler = overcommit_kbytes_handler ,
} ,
{
. procname = " page-cluster " ,
. data = & page_cluster ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
} ,
{
. procname = " dirtytime_expire_seconds " ,
. data = & dirtytime_expire_interval ,
. maxlen = sizeof ( dirtytime_expire_interval ) ,
. mode = 0644 ,
. proc_handler = dirtytime_interval_handler ,
. extra1 = SYSCTL_ZERO ,
} ,
{
. procname = " swappiness " ,
. data = & vm_swappiness ,
. maxlen = sizeof ( vm_swappiness ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_TWO_HUNDRED ,
2020-04-24 09:43:37 +03:00
} ,
2022-06-09 13:40:32 +03:00
# ifdef CONFIG_NUMA
{
. procname = " numa_stat " ,
. data = & sysctl_vm_numa_stat ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = sysctl_vm_numa_stat_handler ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
# endif
2020-04-24 09:43:37 +03:00
# ifdef CONFIG_HUGETLB_PAGE
{
. procname = " nr_hugepages " ,
. data = NULL ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = hugetlb_sysctl_handler ,
} ,
# ifdef CONFIG_NUMA
{
. procname = " nr_hugepages_mempolicy " ,
. data = NULL ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = & hugetlb_mempolicy_sysctl_handler ,
} ,
# endif
{
. procname = " hugetlb_shm_group " ,
. data = & sysctl_hugetlb_shm_group ,
. maxlen = sizeof ( gid_t ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " nr_overcommit_hugepages " ,
. data = NULL ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = hugetlb_overcommit_handler ,
} ,
# endif
{
. procname = " lowmem_reserve_ratio " ,
. data = & sysctl_lowmem_reserve_ratio ,
. maxlen = sizeof ( sysctl_lowmem_reserve_ratio ) ,
. mode = 0644 ,
. proc_handler = lowmem_reserve_ratio_sysctl_handler ,
} ,
{
. procname = " drop_caches " ,
. data = & sysctl_drop_caches ,
. maxlen = sizeof ( int ) ,
. mode = 0200 ,
. proc_handler = drop_caches_sysctl_handler ,
. extra1 = SYSCTL_ONE ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_FOUR ,
2020-04-24 09:43:37 +03:00
} ,
# ifdef CONFIG_COMPACTION
{
. procname = " compact_memory " ,
2021-05-05 04:36:48 +03:00
. data = NULL ,
2020-04-24 09:43:37 +03:00
. maxlen = sizeof ( int ) ,
. mode = 0200 ,
. proc_handler = sysctl_compaction_handler ,
} ,
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:31:00 +03:00
{
. procname = " compaction_proactiveness " ,
. data = & sysctl_compaction_proactiveness ,
2020-08-12 04:31:07 +03:00
. maxlen = sizeof ( sysctl_compaction_proactiveness ) ,
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:31:00 +03:00
. mode = 0644 ,
2021-09-03 00:59:59 +03:00
. proc_handler = compaction_proactiveness_sysctl_handler ,
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:31:00 +03:00
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_ONE_HUNDRED ,
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:31:00 +03:00
} ,
2020-04-24 09:43:37 +03:00
{
. procname = " extfrag_threshold " ,
. data = & sysctl_extfrag_threshold ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
2022-01-22 09:11:19 +03:00
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:11:14 +03:00
. extra2 = ( void * ) & max_extfrag_threshold ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " compact_unevictable_allowed " ,
. data = & sysctl_compact_unevictable_allowed ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax_warn_RT_change ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2005-04-17 02:20:36 +04:00
2020-04-24 09:43:37 +03:00
# endif /* CONFIG_COMPACTION */
{
. procname = " min_free_kbytes " ,
. data = & min_free_kbytes ,
. maxlen = sizeof ( min_free_kbytes ) ,
. mode = 0644 ,
. proc_handler = min_free_kbytes_sysctl_handler ,
. extra1 = SYSCTL_ZERO ,
} ,
{
. procname = " watermark_boost_factor " ,
. data = & watermark_boost_factor ,
. maxlen = sizeof ( watermark_boost_factor ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
} ,
{
. procname = " watermark_scale_factor " ,
. data = & watermark_scale_factor ,
. maxlen = sizeof ( watermark_scale_factor ) ,
. mode = 0644 ,
. proc_handler = watermark_scale_factor_sysctl_handler ,
. extra1 = SYSCTL_ONE ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_THREE_THOUSAND ,
2020-04-24 09:43:37 +03:00
} ,
{
2021-06-29 05:42:24 +03:00
. procname = " percpu_pagelist_high_fraction " ,
. data = & percpu_pagelist_high_fraction ,
. maxlen = sizeof ( percpu_pagelist_high_fraction ) ,
2020-04-24 09:43:37 +03:00
. mode = 0644 ,
2021-06-29 05:42:24 +03:00
. proc_handler = percpu_pagelist_high_fraction_sysctl_handler ,
2020-04-24 09:43:37 +03:00
. extra1 = SYSCTL_ZERO ,
} ,
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-14 00:05:35 +03:00
{
. procname = " page_lock_unfairness " ,
. data = & sysctl_page_lock_unfairness ,
. maxlen = sizeof ( sysctl_page_lock_unfairness ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
} ,
2020-04-24 09:43:37 +03:00
# ifdef CONFIG_MMU
{
. procname = " max_map_count " ,
. data = & sysctl_max_map_count ,
. maxlen = sizeof ( sysctl_max_map_count ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
} ,
# else
{
. procname = " nr_trim_pages " ,
. data = & sysctl_nr_trim_pages ,
. maxlen = sizeof ( sysctl_nr_trim_pages ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
} ,
# endif
{
. procname = " vfs_cache_pressure " ,
. data = & sysctl_vfs_cache_pressure ,
. maxlen = sizeof ( sysctl_vfs_cache_pressure ) ,
. mode = 0644 ,
2021-02-26 04:20:53 +03:00
. proc_handler = proc_dointvec_minmax ,
2020-04-24 09:43:37 +03:00
. extra1 = SYSCTL_ZERO ,
} ,
# if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \
defined ( CONFIG_ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT )
{
. procname = " legacy_va_layout " ,
. data = & sysctl_legacy_va_layout ,
. maxlen = sizeof ( sysctl_legacy_va_layout ) ,
. mode = 0644 ,
2021-02-26 04:20:53 +03:00
. proc_handler = proc_dointvec_minmax ,
2020-04-24 09:43:37 +03:00
. extra1 = SYSCTL_ZERO ,
} ,
# endif
# ifdef CONFIG_NUMA
{
. procname = " zone_reclaim_mode " ,
. data = & node_reclaim_mode ,
. maxlen = sizeof ( node_reclaim_mode ) ,
. mode = 0644 ,
2021-02-26 04:20:53 +03:00
. proc_handler = proc_dointvec_minmax ,
2020-04-24 09:43:37 +03:00
. extra1 = SYSCTL_ZERO ,
} ,
{
. procname = " min_unmapped_ratio " ,
. data = & sysctl_min_unmapped_ratio ,
. maxlen = sizeof ( sysctl_min_unmapped_ratio ) ,
. mode = 0644 ,
. proc_handler = sysctl_min_unmapped_ratio_sysctl_handler ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_ONE_HUNDRED ,
2020-04-24 09:43:37 +03:00
} ,
{
. procname = " min_slab_ratio " ,
. data = & sysctl_min_slab_ratio ,
. maxlen = sizeof ( sysctl_min_slab_ratio ) ,
. mode = 0644 ,
. proc_handler = sysctl_min_slab_ratio_sysctl_handler ,
. extra1 = SYSCTL_ZERO ,
2022-01-22 09:10:55 +03:00
. extra2 = SYSCTL_ONE_HUNDRED ,
2020-04-24 09:43:37 +03:00
} ,
# endif
# ifdef CONFIG_SMP
{
. procname = " stat_interval " ,
. data = & sysctl_stat_interval ,
. maxlen = sizeof ( sysctl_stat_interval ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_jiffies ,
} ,
{
. procname = " stat_refresh " ,
. data = NULL ,
. maxlen = 0 ,
. mode = 0600 ,
. proc_handler = vmstat_refresh ,
} ,
# endif
# ifdef CONFIG_MMU
{
. procname = " mmap_min_addr " ,
. data = & dac_mmap_min_addr ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = mmap_min_addr_handler ,
} ,
# endif
# ifdef CONFIG_NUMA
{
. procname = " numa_zonelist_order " ,
. data = & numa_zonelist_order ,
. maxlen = NUMA_ZONELIST_ORDER_LEN ,
. mode = 0644 ,
. proc_handler = numa_zonelist_order_handler ,
} ,
# endif
# if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
( defined ( CONFIG_SUPERH ) & & defined ( CONFIG_VSYSCALL ) )
{
. procname = " vdso_enabled " ,
# ifdef CONFIG_X86_32
. data = & vdso32_enabled ,
. maxlen = sizeof ( vdso32_enabled ) ,
# else
. data = & vdso_enabled ,
. maxlen = sizeof ( vdso_enabled ) ,
# endif
. mode = 0644 ,
. proc_handler = proc_dointvec ,
. extra1 = SYSCTL_ZERO ,
} ,
# endif
# ifdef CONFIG_MEMORY_FAILURE
{
. procname = " memory_failure_early_kill " ,
. data = & sysctl_memory_failure_early_kill ,
. maxlen = sizeof ( sysctl_memory_failure_early_kill ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
{
. procname = " memory_failure_recovery " ,
. data = & sysctl_memory_failure_recovery ,
. maxlen = sizeof ( sysctl_memory_failure_recovery ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
# endif
{
. procname = " user_reserve_kbytes " ,
. data = & sysctl_user_reserve_kbytes ,
. maxlen = sizeof ( sysctl_user_reserve_kbytes ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
{
. procname = " admin_reserve_kbytes " ,
. data = & sysctl_admin_reserve_kbytes ,
. maxlen = sizeof ( sysctl_admin_reserve_kbytes ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
# ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
{
. procname = " mmap_rnd_bits " ,
. data = & mmap_rnd_bits ,
. maxlen = sizeof ( mmap_rnd_bits ) ,
. mode = 0600 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = ( void * ) & mmap_rnd_bits_min ,
. extra2 = ( void * ) & mmap_rnd_bits_max ,
} ,
# endif
# ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
{
. procname = " mmap_rnd_compat_bits " ,
. data = & mmap_rnd_compat_bits ,
. maxlen = sizeof ( mmap_rnd_compat_bits ) ,
. mode = 0600 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = ( void * ) & mmap_rnd_compat_bits_min ,
. extra2 = ( void * ) & mmap_rnd_compat_bits_max ,
} ,
# endif
# ifdef CONFIG_USERFAULTFD
{
. procname = " unprivileged_userfaultfd " ,
. data = & sysctl_unprivileged_userfaultfd ,
. maxlen = sizeof ( sysctl_unprivileged_userfaultfd ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
# endif
{ }
} ;
2005-04-17 02:20:36 +04:00
2020-04-24 09:43:37 +03:00
static struct ctl_table debug_table [ ] = {
# ifdef CONFIG_SYSCTL_EXCEPTION_TRACE
{
. procname = " exception-trace " ,
. data = & show_unhandled_signals ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
# endif
{ }
} ;
2005-04-17 02:20:36 +04:00
2020-04-24 09:43:37 +03:00
static struct ctl_table dev_table [ ] = {
{ }
} ;
2005-04-17 02:20:36 +04:00
2022-01-22 09:13:24 +03:00
DECLARE_SYSCTL_BASE ( kernel , kern_table ) ;
DECLARE_SYSCTL_BASE ( vm , vm_table ) ;
DECLARE_SYSCTL_BASE ( debug , debug_table ) ;
DECLARE_SYSCTL_BASE ( dev , dev_table ) ;
2005-04-17 02:20:36 +04:00
2022-01-22 09:13:31 +03:00
int __init sysctl_init_bases ( void )
2019-02-26 01:28:39 +03:00
{
2022-01-22 09:13:24 +03:00
register_sysctl_base ( kernel ) ;
register_sysctl_base ( vm ) ;
register_sysctl_base ( debug ) ;
register_sysctl_base ( dev ) ;
2019-02-26 01:28:39 +03:00
2020-04-24 09:43:37 +03:00
return 0 ;
2019-02-26 01:28:39 +03:00
}
2020-04-24 09:43:37 +03:00
# endif /* CONFIG_SYSCTL */
2005-04-17 02:20:36 +04:00
/*
* No sense putting this after each symbol definition , twice ,
* exception granted : - )
*/
2021-08-03 13:59:36 +03:00
EXPORT_SYMBOL ( proc_dobool ) ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( proc_dointvec ) ;
2016-08-26 01:16:51 +03:00
EXPORT_SYMBOL ( proc_douintvec ) ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( proc_dointvec_jiffies ) ;
EXPORT_SYMBOL ( proc_dointvec_minmax ) ;
2017-07-13 00:33:40 +03:00
EXPORT_SYMBOL_GPL ( proc_douintvec_minmax ) ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( proc_dointvec_userhz_jiffies ) ;
EXPORT_SYMBOL ( proc_dointvec_ms_jiffies ) ;
EXPORT_SYMBOL ( proc_dostring ) ;
EXPORT_SYMBOL ( proc_doulongvec_minmax ) ;
EXPORT_SYMBOL ( proc_doulongvec_ms_jiffies_minmax ) ;
2019-04-17 23:35:49 +03:00
EXPORT_SYMBOL ( proc_do_large_bitmap ) ;