2019-06-01 11:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2007-02-14 11:34:06 +03:00
/*
* Copyright ( C ) 2007
*
* Author : Eric Biederman < ebiederm @ xmision . com >
*/
# include <linux/module.h>
# include <linux/ipc.h>
# include <linux/nsproxy.h>
# include <linux/sysctl.h>
# include <linux/uaccess.h>
2021-11-09 05:35:59 +03:00
# include <linux/capability.h>
2008-02-08 15:18:22 +03:00
# include <linux/ipc_namespace.h>
2008-04-29 12:00:45 +04:00
# include <linux/msg.h>
2022-02-14 21:18:15 +03:00
# include <linux/slab.h>
2024-01-15 18:46:41 +03:00
# include <linux/cred.h>
2008-04-29 12:00:45 +04:00
# include "util.h"
2007-02-14 11:34:06 +03:00
2022-02-14 21:18:15 +03:00
static int proc_ipc_dointvec_minmax_orphans ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2011-07-27 03:08:48 +04:00
{
2022-05-03 16:39:55 +03:00
struct ipc_namespace * ns =
container_of ( table - > data , struct ipc_namespace , shm_rmid_forced ) ;
2022-02-14 21:18:15 +03:00
int err ;
2011-07-27 03:08:48 +04:00
2022-05-03 16:39:55 +03:00
err = proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
2011-07-27 03:08:48 +04:00
if ( err < 0 )
return err ;
if ( ns - > shm_rmid_forced )
shm_destroy_orphaned ( ns ) ;
return err ;
}
2014-12-13 03:58:17 +03:00
static int proc_ipc_auto_msgmni ( struct ctl_table * table , int write ,
2020-04-24 09:43:38 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2008-07-25 12:48:08 +04:00
{
struct ctl_table ipc_table ;
2014-12-13 03:58:17 +03:00
int dummy = 0 ;
2008-07-25 12:48:08 +04:00
memcpy ( & ipc_table , table , sizeof ( ipc_table ) ) ;
2014-12-13 03:58:17 +03:00
ipc_table . data = & dummy ;
if ( write )
pr_info_once ( " writing to auto_msgmni has no effect " ) ;
return proc_dointvec_minmax ( & ipc_table , write , buffer , lenp , ppos ) ;
2008-07-25 12:48:08 +04:00
}
2018-10-31 01:07:24 +03:00
static int proc_ipc_sem_dointvec ( struct ctl_table * table , int write ,
2020-09-05 02:35:46 +03:00
void * buffer , size_t * lenp , loff_t * ppos )
2018-10-31 01:07:24 +03:00
{
2022-05-03 16:39:55 +03:00
struct ipc_namespace * ns =
container_of ( table - > data , struct ipc_namespace , sem_ctls ) ;
2018-10-31 01:07:24 +03:00
int ret , semmni ;
2022-02-14 21:18:15 +03:00
2018-10-31 01:07:24 +03:00
semmni = ns - > sem_ctls [ 3 ] ;
2022-02-14 21:18:15 +03:00
ret = proc_dointvec ( table , write , buffer , lenp , ppos ) ;
2018-10-31 01:07:24 +03:00
if ( ! ret )
2022-05-03 16:39:54 +03:00
ret = sem_check_semmni ( ns ) ;
2018-10-31 01:07:24 +03:00
/*
* Reset the semmni value if an error happens .
*/
if ( ret )
ns - > sem_ctls [ 3 ] = semmni ;
return ret ;
}
2019-05-15 01:46:29 +03:00
int ipc_mni = IPCMNI ;
int ipc_mni_shift = IPCMNI_SHIFT ;
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 01:46:36 +03:00
int ipc_min_cycle = RADIX_TREE_MAP_SIZE ;
2008-07-25 12:48:08 +04:00
2022-02-14 21:18:15 +03:00
static struct ctl_table ipc_sysctls [ ] = {
2007-02-14 11:34:06 +03:00
{
. procname = " shmmax " ,
. data = & init_ipc_ns . shm_ctlmax ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . shm_ctlmax ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_doulongvec_minmax ,
2007-02-14 11:34:06 +03:00
} ,
{
. procname = " shmall " ,
. data = & init_ipc_ns . shm_ctlall ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . shm_ctlall ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_doulongvec_minmax ,
2007-02-14 11:34:06 +03:00
} ,
{
. procname = " shmmni " ,
. data = & init_ipc_ns . shm_ctlmni ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . shm_ctlmni ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 01:58:50 +03:00
. extra1 = SYSCTL_ZERO ,
2018-10-31 01:07:20 +03:00
. extra2 = & ipc_mni ,
2007-02-14 11:34:06 +03:00
} ,
2011-07-27 03:08:48 +04:00
{
. procname = " shm_rmid_forced " ,
. data = & init_ipc_ns . shm_rmid_forced ,
. maxlen = sizeof ( init_ipc_ns . shm_rmid_forced ) ,
. mode = 0644 ,
. proc_handler = proc_ipc_dointvec_minmax_orphans ,
2022-05-03 16:39:55 +03:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2011-07-27 03:08:48 +04:00
} ,
2007-02-14 11:34:06 +03:00
{
. procname = " msgmax " ,
. data = & init_ipc_ns . msg_ctlmax ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . msg_ctlmax ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 01:58:50 +03:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_INT_MAX ,
2007-02-14 11:34:06 +03:00
} ,
{
. procname = " msgmni " ,
. data = & init_ipc_ns . msg_ctlmni ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . msg_ctlmni ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 01:58:50 +03:00
. extra1 = SYSCTL_ZERO ,
2018-10-31 01:07:20 +03:00
. extra2 = & ipc_mni ,
2007-02-14 11:34:06 +03:00
} ,
2014-12-13 03:58:17 +03:00
{
. procname = " auto_msgmni " ,
. data = NULL ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_ipc_auto_msgmni ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 01:58:50 +03:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2014-12-13 03:58:17 +03:00
} ,
2007-02-14 11:34:06 +03:00
{
. procname = " msgmnb " ,
. data = & init_ipc_ns . msg_ctlmnb ,
2014-01-28 05:07:04 +04:00
. maxlen = sizeof ( init_ipc_ns . msg_ctlmnb ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2022-02-14 21:18:15 +03:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 01:58:50 +03:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_INT_MAX ,
2007-02-14 11:34:06 +03:00
} ,
{
. procname = " sem " ,
. data = & init_ipc_ns . sem_ctls ,
2014-01-28 05:07:04 +04:00
. maxlen = 4 * sizeof ( int ) ,
2007-02-14 11:34:06 +03:00
. mode = 0644 ,
2018-10-31 01:07:24 +03:00
. proc_handler = proc_ipc_sem_dointvec ,
2007-02-14 11:34:06 +03:00
} ,
2013-01-05 03:34:50 +04:00
# ifdef CONFIG_CHECKPOINT_RESTORE
{
. procname = " sem_next_id " ,
. data = & init_ipc_ns . ids [ IPC_SEM_IDS ] . next_id ,
. maxlen = sizeof ( init_ipc_ns . ids [ IPC_SEM_IDS ] . next_id ) ,
2022-05-03 16:39:56 +03:00
. mode = 0444 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_INT_MAX ,
2013-01-05 03:34:50 +04:00
} ,
{
. procname = " msg_next_id " ,
. data = & init_ipc_ns . ids [ IPC_MSG_IDS ] . next_id ,
. maxlen = sizeof ( init_ipc_ns . ids [ IPC_MSG_IDS ] . next_id ) ,
2022-05-03 16:39:56 +03:00
. mode = 0444 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_INT_MAX ,
2013-01-05 03:34:50 +04:00
} ,
{
. procname = " shm_next_id " ,
. data = & init_ipc_ns . ids [ IPC_SHM_IDS ] . next_id ,
. maxlen = sizeof ( init_ipc_ns . ids [ IPC_SHM_IDS ] . next_id ) ,
2022-05-03 16:39:56 +03:00
. mode = 0444 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_INT_MAX ,
2013-01-05 03:34:50 +04:00
} ,
# endif
2007-02-14 11:34:06 +03:00
{ }
} ;
2022-02-14 21:18:15 +03:00
static struct ctl_table_set * set_lookup ( struct ctl_table_root * root )
{
return & current - > nsproxy - > ipc_ns - > ipc_set ;
}
static int set_is_seen ( struct ctl_table_set * set )
{
return & current - > nsproxy - > ipc_ns - > ipc_set = = set ;
}
2024-01-15 18:46:41 +03:00
static void ipc_set_ownership ( struct ctl_table_header * head ,
struct ctl_table * table ,
kuid_t * uid , kgid_t * gid )
{
struct ipc_namespace * ns =
container_of ( head - > set , struct ipc_namespace , ipc_set ) ;
kuid_t ns_root_uid = make_kuid ( ns - > user_ns , 0 ) ;
kgid_t ns_root_gid = make_kgid ( ns - > user_ns , 0 ) ;
* uid = uid_valid ( ns_root_uid ) ? ns_root_uid : GLOBAL_ROOT_UID ;
* gid = gid_valid ( ns_root_gid ) ? ns_root_gid : GLOBAL_ROOT_GID ;
}
2022-05-03 16:39:56 +03:00
static int ipc_permissions ( struct ctl_table_header * head , struct ctl_table * table )
{
int mode = table - > mode ;
# ifdef CONFIG_CHECKPOINT_RESTORE
2024-01-15 18:46:41 +03:00
struct ipc_namespace * ns =
container_of ( head - > set , struct ipc_namespace , ipc_set ) ;
2022-05-03 16:39:56 +03:00
if ( ( ( table - > data = = & ns - > ids [ IPC_SEM_IDS ] . next_id ) | |
( table - > data = = & ns - > ids [ IPC_MSG_IDS ] . next_id ) | |
( table - > data = = & ns - > ids [ IPC_SHM_IDS ] . next_id ) ) & &
checkpoint_restore_ns_capable ( ns - > user_ns ) )
mode = 0666 ;
2024-01-15 18:46:41 +03:00
else
2022-05-03 16:39:56 +03:00
# endif
2024-01-15 18:46:41 +03:00
{
kuid_t ns_root_uid ;
kgid_t ns_root_gid ;
ipc_set_ownership ( head , table , & ns_root_uid , & ns_root_gid ) ;
if ( uid_eq ( current_euid ( ) , ns_root_uid ) )
mode > > = 6 ;
else if ( in_egroup_p ( ns_root_gid ) )
mode > > = 3 ;
}
mode & = 7 ;
return ( mode < < 6 ) | ( mode < < 3 ) | mode ;
2022-05-03 16:39:56 +03:00
}
2022-02-14 21:18:15 +03:00
static struct ctl_table_root set_root = {
. lookup = set_lookup ,
2022-05-03 16:39:56 +03:00
. permissions = ipc_permissions ,
2024-01-15 18:46:41 +03:00
. set_ownership = ipc_set_ownership ,
2007-02-14 11:34:06 +03:00
} ;
2022-02-14 21:18:15 +03:00
bool setup_ipc_sysctls ( struct ipc_namespace * ns )
{
struct ctl_table * tbl ;
setup_sysctl_set ( & ns - > ipc_set , & set_root , set_is_seen ) ;
tbl = kmemdup ( ipc_sysctls , sizeof ( ipc_sysctls ) , GFP_KERNEL ) ;
if ( tbl ) {
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( ipc_sysctls ) ; i + + ) {
2022-05-03 16:39:57 +03:00
if ( tbl [ i ] . data = = & init_ipc_ns . shm_ctlmax )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > shm_ctlmax ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . shm_ctlall )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > shm_ctlall ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . shm_ctlmni )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > shm_ctlmni ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . shm_rmid_forced )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > shm_rmid_forced ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . msg_ctlmax )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > msg_ctlmax ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . msg_ctlmni )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > msg_ctlmni ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . msg_ctlmnb )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > msg_ctlmnb ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . sem_ctls )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > sem_ctls ;
# ifdef CONFIG_CHECKPOINT_RESTORE
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . ids [ IPC_SEM_IDS ] . next_id )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > ids [ IPC_SEM_IDS ] . next_id ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . ids [ IPC_MSG_IDS ] . next_id )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > ids [ IPC_MSG_IDS ] . next_id ;
2022-05-03 16:39:57 +03:00
else if ( tbl [ i ] . data = = & init_ipc_ns . ids [ IPC_SHM_IDS ] . next_id )
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = & ns - > ids [ IPC_SHM_IDS ] . next_id ;
# endif
2022-05-03 16:39:57 +03:00
else
2022-02-14 21:18:15 +03:00
tbl [ i ] . data = NULL ;
}
2024-02-19 23:19:23 +03:00
ns - > ipc_sysctls = __register_sysctl_table ( & ns - > ipc_set , " kernel " , tbl ,
2023-08-09 13:49:57 +03:00
ARRAY_SIZE ( ipc_sysctls ) ) ;
2022-02-14 21:18:15 +03:00
}
if ( ! ns - > ipc_sysctls ) {
kfree ( tbl ) ;
retire_sysctl_set ( & ns - > ipc_set ) ;
return false ;
}
return true ;
}
void retire_ipc_sysctls ( struct ipc_namespace * ns )
{
struct ctl_table * tbl ;
tbl = ns - > ipc_sysctls - > ctl_table_arg ;
unregister_sysctl_table ( ns - > ipc_sysctls ) ;
retire_sysctl_set ( & ns - > ipc_set ) ;
kfree ( tbl ) ;
}
2007-02-14 11:34:06 +03:00
static int __init ipc_sysctl_init ( void )
{
2022-02-14 21:18:15 +03:00
if ( ! setup_ipc_sysctls ( & init_ipc_ns ) ) {
pr_warn ( " ipc sysctl registration failed \n " ) ;
return - ENOMEM ;
}
2007-02-14 11:34:06 +03:00
return 0 ;
}
2014-04-08 02:39:18 +04:00
device_initcall ( ipc_sysctl_init ) ;
2019-05-15 01:46:29 +03:00
static int __init ipc_mni_extend ( char * str )
{
ipc_mni = IPCMNI_EXTEND ;
ipc_mni_shift = IPCMNI_EXTEND_SHIFT ;
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 01:46:36 +03:00
ipc_min_cycle = IPCMNI_EXTEND_MIN_CYCLE ;
2019-05-15 01:46:29 +03:00
pr_info ( " IPCMNI extended to %d. \n " , ipc_mni ) ;
return 0 ;
}
early_param ( " ipcmni_extend " , ipc_mni_extend ) ;