License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
// SPDX-License-Identifier: GPL-2.0
2005-04-16 15:20:36 -07:00
/*
* sysctl_net_ipv4 . c : sysctl interface to net IPV4 subsystem .
*
* Begun April 1 , 1996 , Mike Shaver .
* Added / proc / sys / net / ipv4 directory entry ( empty = ) ) . [ MS ]
*/
# include <linux/sysctl.h>
2007-10-10 17:30:46 -07:00
# include <linux/seqlock.h>
2007-12-05 01:41:26 -08:00
# include <linux/init.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2005-08-16 02:18:02 -03:00
# include <net/icmp.h>
2005-04-16 15:20:36 -07:00
# include <net/ip.h>
2021-05-17 21:15:18 +03:00
# include <net/ip_fib.h>
2005-04-16 15:20:36 -07:00
# include <net/tcp.h>
2007-12-31 00:29:24 -08:00
# include <net/udp.h>
2006-08-03 16:48:06 -07:00
# include <net/cipso_ipv4.h>
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
# include <net/ping.h>
2017-03-23 13:34:16 -06:00
# include <net/protocol.h>
2017-11-02 17:14:05 +01:00
# include <net/netevent.h>
2005-04-16 15:20:36 -07:00
2007-02-09 23:24:47 +09:00
static int tcp_retr1_max = 255 ;
2005-04-16 15:20:36 -07:00
static int ip_local_port_range_min [ ] = { 1 , 1 } ;
static int ip_local_port_range_max [ ] = { 65535 , 65535 } ;
2010-11-22 12:54:21 +00:00
static int tcp_adv_win_scale_min = - 31 ;
static int tcp_adv_win_scale_max = 31 ;
2023-04-06 14:34:50 +08:00
static int tcp_app_win_max = 31 ;
2019-06-06 09:15:31 -07:00
static int tcp_min_snd_mss_min = TCP_MIN_SND_MSS ;
static int tcp_min_snd_mss_max = 65535 ;
2017-01-20 17:49:11 -08:00
static int ip_privileged_port_min ;
static int ip_privileged_port_max = 65535 ;
2010-12-13 12:16:14 -08:00
static int ip_ttl_min = 1 ;
static int ip_ttl_max = 255 ;
2013-07-19 14:09:01 +02:00
static int tcp_syn_retries_min = 1 ;
static int tcp_syn_retries_max = MAX_TCP_SYNCNT ;
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT ;
2023-06-01 12:13:05 +09:00
static unsigned long ip_ping_group_range_min [ ] = { 0 , 0 } ;
static unsigned long ip_ping_group_range_max [ ] = { GID_T_MAX , GID_T_MAX } ;
2018-09-25 21:59:28 -07:00
static u32 u32_max_div_HZ = UINT_MAX / HZ ;
2019-04-16 09:47:24 +08:00
static int one_day_secs = 24 * 3600 ;
2021-05-17 21:15:18 +03:00
static u32 fib_multipath_hash_fields_all_mask __maybe_unused =
FIB_MULTIPATH_HASH_FIELD_ALL_MASK ;
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024 ;
udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 13:57:57 -08:00
static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX ;
2022-10-26 13:51:11 +00:00
static int tcp_plb_max_rounds = 31 ;
static int tcp_plb_max_cong_thresh = 256 ;
2005-04-16 15:20:36 -07:00
2017-07-30 03:57:20 +02:00
/* obsolete */
static int sysctl_tcp_low_latency __read_mostly ;
2007-10-10 17:30:46 -07:00
/* Update system visible IP port range */
2023-12-06 13:44:20 +00:00
static void set_local_port_range ( struct net * net , unsigned int low , unsigned int high )
2007-10-10 17:30:46 -07:00
{
2023-12-06 13:44:20 +00:00
bool same_parity = ! ( ( low ^ high ) & 1 ) ;
2015-05-27 11:34:37 -07:00
if ( same_parity & & ! net - > ipv4 . ip_local_ports . warned ) {
net - > ipv4 . ip_local_ports . warned = true ;
pr_err_ratelimited ( " ip_local_port_range: prefer different parity for start/end values. \n " ) ;
}
2023-12-06 13:44:20 +00:00
WRITE_ONCE ( net - > ipv4 . ip_local_ports . range , high < < 16 | low ) ;
2007-10-10 17:30:46 -07:00
}
/* Validate changes from /proc interface. */
2013-06-11 23:04:25 -07:00
static int ipv4_local_port_range ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
2007-10-10 17:30:46 -07:00
{
2023-12-06 13:44:20 +00:00
struct net * net = table - > data ;
2007-10-10 17:30:46 -07:00
int ret ;
2008-10-08 14:18:04 -07:00
int range [ 2 ] ;
2013-06-11 23:04:25 -07:00
struct ctl_table tmp = {
2007-10-10 17:30:46 -07:00
. data = & range ,
. maxlen = sizeof ( range ) ,
. mode = table - > mode ,
. extra1 = & ip_local_port_range_min ,
. extra2 = & ip_local_port_range_max ,
} ;
2013-09-28 14:10:59 -07:00
inet_get_local_port_range ( net , & range [ 0 ] , & range [ 1 ] ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dointvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
2007-10-10 17:30:46 -07:00
if ( write & & ret = = 0 ) {
2017-01-20 17:49:11 -08:00
/* Ensure that the upper limit is not smaller than the lower,
* and that the lower does not encroach upon the privileged
* port limit .
*/
if ( ( range [ 1 ] < range [ 0 ] ) | |
2022-07-18 10:26:42 -07:00
( range [ 0 ] < READ_ONCE ( net - > ipv4 . sysctl_ip_prot_sock ) ) )
2007-10-10 17:30:46 -07:00
ret = - EINVAL ;
else
2023-12-06 13:44:20 +00:00
set_local_port_range ( net , range [ 0 ] , range [ 1 ] ) ;
2007-10-10 17:30:46 -07:00
}
return ret ;
}
2017-01-20 17:49:11 -08:00
/* Validate changes from /proc interface. */
static int ipv4_privileged_ports ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
2017-01-20 17:49:11 -08:00
{
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_ip_prot_sock ) ;
int ret ;
int pports ;
int range [ 2 ] ;
struct ctl_table tmp = {
. data = & pports ,
. maxlen = sizeof ( pports ) ,
. mode = table - > mode ,
. extra1 = & ip_privileged_port_min ,
. extra2 = & ip_privileged_port_max ,
} ;
2022-07-18 10:26:42 -07:00
pports = READ_ONCE ( net - > ipv4 . sysctl_ip_prot_sock ) ;
2017-01-20 17:49:11 -08:00
ret = proc_dointvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
if ( write & & ret = = 0 ) {
inet_get_local_port_range ( net , & range [ 0 ] , & range [ 1 ] ) ;
/* Ensure that the local port range doesn't overlap with the
* privileged port range .
*/
if ( range [ 0 ] < pports )
ret = - EINVAL ;
else
2022-07-18 10:26:42 -07:00
WRITE_ONCE ( net - > ipv4 . sysctl_ip_prot_sock , pports ) ;
2017-01-20 17:49:11 -08:00
}
return ret ;
}
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
2012-05-24 10:34:21 -06:00
static void inet_get_ping_group_range_table ( struct ctl_table * table , kgid_t * low , kgid_t * high )
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
2012-05-24 10:34:21 -06:00
kgid_t * data = table - > data ;
2013-09-28 14:10:59 -07:00
struct net * net =
2014-05-06 11:02:50 -07:00
container_of ( table - > data , struct net , ipv4 . ping_group_range . range ) ;
2012-04-15 05:58:06 +00:00
unsigned int seq ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
do {
2016-10-20 14:19:46 -07:00
seq = read_seqbegin ( & net - > ipv4 . ping_group_range . lock ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
* low = data [ 0 ] ;
* high = data [ 1 ] ;
2016-10-20 14:19:46 -07:00
} while ( read_seqretry ( & net - > ipv4 . ping_group_range . lock , seq ) ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
}
/* Update system visible IP port range */
2012-05-24 10:34:21 -06:00
static void set_ping_group_range ( struct ctl_table * table , kgid_t low , kgid_t high )
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
2012-05-24 10:34:21 -06:00
kgid_t * data = table - > data ;
2013-09-28 14:10:59 -07:00
struct net * net =
2014-05-06 11:02:50 -07:00
container_of ( table - > data , struct net , ipv4 . ping_group_range . range ) ;
2016-10-20 14:19:46 -07:00
write_seqlock ( & net - > ipv4 . ping_group_range . lock ) ;
2012-05-24 10:34:21 -06:00
data [ 0 ] = low ;
data [ 1 ] = high ;
2016-10-20 14:19:46 -07:00
write_sequnlock ( & net - > ipv4 . ping_group_range . lock ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
}
/* Validate changes from /proc interface. */
2013-06-11 23:04:25 -07:00
static int ipv4_ping_group_range ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
2012-05-24 10:34:21 -06:00
struct user_namespace * user_ns = current_user_ns ( ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
int ret ;
2023-06-01 12:13:05 +09:00
unsigned long urange [ 2 ] ;
2012-05-24 10:34:21 -06:00
kgid_t low , high ;
2013-06-11 23:04:25 -07:00
struct ctl_table tmp = {
2012-05-24 10:34:21 -06:00
. data = & urange ,
. maxlen = sizeof ( urange ) ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
. mode = table - > mode ,
. extra1 = & ip_ping_group_range_min ,
. extra2 = & ip_ping_group_range_max ,
} ;
2012-05-24 10:34:21 -06:00
inet_get_ping_group_range_table ( table , & low , & high ) ;
urange [ 0 ] = from_kgid_munged ( user_ns , low ) ;
urange [ 1 ] = from_kgid_munged ( user_ns , high ) ;
2023-06-01 12:13:05 +09:00
ret = proc_doulongvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
2012-05-24 10:34:21 -06:00
if ( write & & ret = = 0 ) {
low = make_kgid ( user_ns , urange [ 0 ] ) ;
high = make_kgid ( user_ns , urange [ 1 ] ) ;
2018-07-05 18:49:23 +00:00
if ( ! gid_valid ( low ) | | ! gid_valid ( high ) )
return - EINVAL ;
if ( urange [ 1 ] < urange [ 0 ] | | gid_lt ( high , low ) ) {
2012-05-24 10:34:21 -06:00
low = make_kgid ( & init_user_ns , 1 ) ;
high = make_kgid ( & init_user_ns , 0 ) ;
}
set_ping_group_range ( table , low , high ) ;
}
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
return ret ;
}
2018-08-01 00:36:42 +02:00
static int ipv4_fwd_update_priority ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
2018-08-01 00:36:42 +02:00
{
struct net * net ;
int ret ;
net = container_of ( table - > data , struct net ,
ipv4 . sysctl_ip_fwd_update_priority ) ;
2021-03-25 11:08:15 -07:00
ret = proc_dou8vec_minmax ( table , write , buffer , lenp , ppos ) ;
2018-08-01 00:36:42 +02:00
if ( write & & ret = = 0 )
call_netevent_notifiers ( NETEVENT_IPV4_FWD_UPDATE_PRIORITY_UPDATE ,
net ) ;
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_tcp_congestion_control ( struct ctl_table * ctl , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
2005-06-23 12:19:55 -07:00
{
2017-11-14 08:25:49 -08:00
struct net * net = container_of ( ctl - > data , struct net ,
ipv4 . tcp_congestion_control ) ;
2005-06-23 12:19:55 -07:00
char val [ TCP_CA_NAME_MAX ] ;
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = {
2005-06-23 12:19:55 -07:00
. data = val ,
. maxlen = TCP_CA_NAME_MAX ,
} ;
int ret ;
2017-11-14 08:25:49 -08:00
tcp_get_default_congestion_control ( net , val ) ;
2005-06-23 12:19:55 -07:00
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2005-06-23 12:19:55 -07:00
if ( write & & ret = = 0 )
2017-11-14 08:25:49 -08:00
ret = tcp_set_default_congestion_control ( net , val ) ;
2005-06-23 12:19:55 -07:00
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_tcp_available_congestion_control ( struct ctl_table * ctl ,
2020-04-24 08:43:38 +02:00
int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2006-11-09 16:32:06 -08:00
{
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = { . maxlen = TCP_CA_BUF_MAX , } ;
2006-11-09 16:32:06 -08:00
int ret ;
tbl . data = kmalloc ( tbl . maxlen , GFP_USER ) ;
if ( ! tbl . data )
return - ENOMEM ;
tcp_get_available_congestion_control ( tbl . data , TCP_CA_BUF_MAX ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2006-11-09 16:32:06 -08:00
kfree ( tbl . data ) ;
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_allowed_congestion_control ( struct ctl_table * ctl ,
2020-04-24 08:43:38 +02:00
int write , void * buffer ,
size_t * lenp , loff_t * ppos )
2006-11-09 16:35:15 -08:00
{
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = { . maxlen = TCP_CA_BUF_MAX } ;
2006-11-09 16:35:15 -08:00
int ret ;
tbl . data = kmalloc ( tbl . maxlen , GFP_USER ) ;
if ( ! tbl . data )
return - ENOMEM ;
tcp_get_allowed_congestion_control ( tbl . data , tbl . maxlen ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2006-11-09 16:35:15 -08:00
if ( write & & ret = = 0 )
ret = tcp_set_allowed_congestion_control ( tbl . data ) ;
kfree ( tbl . data ) ;
return ret ;
}
2019-05-29 12:33:59 -04:00
static int sscanf_key ( char * buf , __le32 * key )
{
u32 user_key [ 4 ] ;
int i , ret = 0 ;
if ( sscanf ( buf , " %x-%x-%x-%x " , user_key , user_key + 1 ,
user_key + 2 , user_key + 3 ) ! = 4 ) {
ret = - EINVAL ;
} else {
for ( i = 0 ; i < ARRAY_SIZE ( user_key ) ; i + + )
key [ i ] = cpu_to_le32 ( user_key [ i ] ) ;
}
pr_debug ( " proc TFO key set 0x%x-%x-%x-%x <- 0x%s: %u \n " ,
user_key [ 0 ] , user_key [ 1 ] , user_key [ 2 ] , user_key [ 3 ] , buf , ret ) ;
return ret ;
}
2017-09-27 11:35:42 +08:00
static int proc_tcp_fastopen_key ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp , loff_t * ppos )
2012-08-31 12:29:11 +00:00
{
2017-09-27 11:35:42 +08:00
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_tcp_fastopen ) ;
2019-05-29 12:33:59 -04:00
/* maxlen to print the list of keys in hex (*2), with dashes
* separating doublewords and a comma in between keys .
*/
struct ctl_table tbl = { . maxlen = ( ( TCP_FASTOPEN_KEY_LENGTH *
2 * TCP_FASTOPEN_KEY_MAX ) +
( TCP_FASTOPEN_KEY_MAX * 5 ) ) } ;
2020-08-10 13:38:39 -04:00
u32 user_key [ TCP_FASTOPEN_KEY_BUF_LENGTH / sizeof ( u32 ) ] ;
__le32 key [ TCP_FASTOPEN_KEY_BUF_LENGTH / sizeof ( __le32 ) ] ;
2019-05-29 12:33:59 -04:00
char * backup_data ;
2020-08-10 13:38:39 -04:00
int ret , i = 0 , off = 0 , n_keys ;
2012-08-31 12:29:11 +00:00
tbl . data = kmalloc ( tbl . maxlen , GFP_KERNEL ) ;
if ( ! tbl . data )
return - ENOMEM ;
2020-08-10 13:38:39 -04:00
n_keys = tcp_fastopen_get_cipher ( net , NULL , ( u64 * ) key ) ;
2019-05-29 12:33:59 -04:00
if ( ! n_keys ) {
memset ( & key [ 0 ] , 0 , TCP_FASTOPEN_KEY_LENGTH ) ;
n_keys = 1 ;
}
for ( i = 0 ; i < n_keys * 4 ; i + + )
2018-06-27 16:04:48 -07:00
user_key [ i ] = le32_to_cpu ( key [ i ] ) ;
2019-05-29 12:33:59 -04:00
for ( i = 0 ; i < n_keys ; i + + ) {
off + = snprintf ( tbl . data + off , tbl . maxlen - off ,
" %08x-%08x-%08x-%08x " ,
user_key [ i * 4 ] ,
user_key [ i * 4 + 1 ] ,
user_key [ i * 4 + 2 ] ,
user_key [ i * 4 + 3 ] ) ;
2019-11-20 16:38:08 +08:00
if ( WARN_ON_ONCE ( off > = tbl . maxlen - 1 ) )
break ;
2019-05-29 12:33:59 -04:00
if ( i + 1 < n_keys )
off + = snprintf ( tbl . data + off , tbl . maxlen - off , " , " ) ;
}
2012-08-31 12:29:11 +00:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
if ( write & & ret = = 0 ) {
2019-05-29 12:33:59 -04:00
backup_data = strchr ( tbl . data , ' , ' ) ;
if ( backup_data ) {
* backup_data = ' \0 ' ;
backup_data + + ;
}
if ( sscanf_key ( tbl . data , key ) ) {
2012-08-31 12:29:11 +00:00
ret = - EINVAL ;
goto bad_key ;
}
2019-05-29 12:33:59 -04:00
if ( backup_data ) {
if ( sscanf_key ( backup_data , key + 4 ) ) {
ret = - EINVAL ;
goto bad_key ;
}
}
tcp_fastopen_reset_cipher ( net , NULL , key ,
2019-06-19 23:46:28 +02:00
backup_data ? key + 4 : NULL ) ;
2012-08-31 12:29:11 +00:00
}
bad_key :
kfree ( tbl . data ) ;
return ret ;
}
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 14:45:46 -07:00
static int proc_tfo_blackhole_detect_timeout ( struct ctl_table * table ,
2020-04-24 08:43:38 +02:00
int write , void * buffer ,
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 14:45:46 -07:00
size_t * lenp , loff_t * ppos )
{
2017-09-27 11:35:43 +08:00
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_tcp_fastopen_blackhole_timeout ) ;
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 14:45:46 -07:00
int ret ;
ret = proc_dointvec_minmax ( table , write , buffer , lenp , ppos ) ;
if ( write & & ret = = 0 )
2017-09-27 11:35:43 +08:00
atomic_set ( & net - > ipv4 . tfo_active_disable_times , 0 ) ;
2017-06-14 11:37:14 -07:00
return ret ;
}
static int proc_tcp_available_ulp ( struct ctl_table * ctl ,
2020-04-24 08:43:38 +02:00
int write , void * buffer , size_t * lenp ,
2017-06-14 11:37:14 -07:00
loff_t * ppos )
{
struct ctl_table tbl = { . maxlen = TCP_ULP_BUF_MAX , } ;
int ret ;
tbl . data = kmalloc ( tbl . maxlen , GFP_USER ) ;
if ( ! tbl . data )
return - ENOMEM ;
tcp_get_available_ulp ( tbl . data , TCP_ULP_BUF_MAX ) ;
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
kfree ( tbl . data ) ;
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 14:45:46 -07:00
return ret ;
}
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
static int proc_tcp_ehash_entries ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_tcp_child_ehash_entries ) ;
struct inet_hashinfo * hinfo = net - > ipv4 . tcp_death_row . hashinfo ;
int tcp_ehash_entries ;
struct ctl_table tbl ;
tcp_ehash_entries = hinfo - > ehash_mask + 1 ;
/* A negative number indicates that the child netns
* shares the global ehash .
*/
if ( ! net_eq ( net , & init_net ) & & ! hinfo - > pernet )
tcp_ehash_entries * = - 1 ;
udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 13:57:57 -08:00
memset ( & tbl , 0 , sizeof ( tbl ) ) ;
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
tbl . data = & tcp_ehash_entries ;
tbl . maxlen = sizeof ( int ) ;
return proc_dointvec ( & tbl , write , buffer , lenp , ppos ) ;
}
udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 13:57:57 -08:00
static int proc_udp_hash_entries ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp , loff_t * ppos )
{
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_udp_child_hash_entries ) ;
int udp_hash_entries ;
struct ctl_table tbl ;
udp_hash_entries = net - > ipv4 . udp_table - > mask + 1 ;
/* A negative number indicates that the child netns
* shares the global udp_table .
*/
if ( ! net_eq ( net , & init_net ) & & net - > ipv4 . udp_table = = & udp_table )
udp_hash_entries * = - 1 ;
memset ( & tbl , 0 , sizeof ( tbl ) ) ;
tbl . data = & udp_hash_entries ;
tbl . maxlen = sizeof ( int ) ;
return proc_dointvec ( & tbl , write , buffer , lenp , ppos ) ;
}
2017-11-02 17:14:05 +01:00
# ifdef CONFIG_IP_ROUTE_MULTIPATH
static int proc_fib_multipath_hash_policy ( struct ctl_table * table , int write ,
2020-04-24 08:43:38 +02:00
void * buffer , size_t * lenp ,
2017-11-02 17:14:05 +01:00
loff_t * ppos )
{
struct net * net = container_of ( table - > data , struct net ,
ipv4 . sysctl_fib_multipath_hash_policy ) ;
int ret ;
2021-03-31 10:52:09 -07:00
ret = proc_dou8vec_minmax ( table , write , buffer , lenp , ppos ) ;
2017-11-02 17:14:05 +01:00
if ( write & & ret = = 0 )
2018-03-02 08:32:16 -08:00
call_netevent_notifiers ( NETEVENT_IPV4_MPATH_HASH_UPDATE , net ) ;
2017-11-02 17:14:05 +01:00
return ret ;
}
2021-05-19 15:08:18 +03:00
static int proc_fib_multipath_hash_fields ( struct ctl_table * table , int write ,
void * buffer , size_t * lenp ,
loff_t * ppos )
{
struct net * net ;
int ret ;
net = container_of ( table - > data , struct net ,
ipv4 . sysctl_fib_multipath_hash_fields ) ;
ret = proc_douintvec_minmax ( table , write , buffer , lenp , ppos ) ;
if ( write & & ret = = 0 )
call_netevent_notifiers ( NETEVENT_IPV4_MPATH_HASH_UPDATE , net ) ;
return ret ;
}
2017-11-02 17:14:05 +01:00
# endif
2007-12-05 01:41:26 -08:00
static struct ctl_table ipv4_table [ ] = {
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_max_orphans " ,
. data = & sysctl_tcp_max_orphans ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " inet_peer_threshold " ,
. data = & inet_peer_threshold ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " inet_peer_minttl " ,
. data = & inet_peer_minttl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " inet_peer_maxttl " ,
. data = & inet_peer_maxttl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
2013-10-19 16:25:36 -07:00
{
. procname = " tcp_mem " ,
. maxlen = sizeof ( sysctl_tcp_mem ) ,
. data = & sysctl_tcp_mem ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_low_latency " ,
. data = & sysctl_tcp_low_latency ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2006-08-03 16:48:06 -07:00
# ifdef CONFIG_NETLABEL
{
. procname = " cipso_cache_enable " ,
. data = & cipso_v4_cache_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_cache_bucket_size " ,
. data = & cipso_v4_cache_bucketsize ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_rbm_optfmt " ,
. data = & cipso_v4_rbm_optfmt ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_rbm_strictvalid " ,
. data = & cipso_v4_rbm_strictvalid ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
# endif /* CONFIG_NETLABEL */
2017-06-14 11:37:14 -07:00
{
. procname = " tcp_available_ulp " ,
. maxlen = TCP_ULP_BUF_MAX ,
. mode = 0444 ,
. proc_handler = proc_tcp_available_ulp ,
} ,
2014-09-19 07:38:40 -07:00
{
. procname = " icmp_msgs_per_sec " ,
. data = & sysctl_icmp_msgs_per_sec ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2014-09-19 07:38:40 -07:00
} ,
{
. procname = " icmp_msgs_burst " ,
. data = & sysctl_icmp_msgs_burst ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2014-09-19 07:38:40 -07:00
} ,
2007-12-31 00:29:24 -08:00
{
. procname = " udp_mem " ,
. data = & sysctl_udp_mem ,
. maxlen = sizeof ( sysctl_udp_mem ) ,
. mode = 0644 ,
2010-11-09 23:24:26 +00:00
. proc_handler = proc_doulongvec_minmax ,
2007-12-31 00:29:24 -08:00
} ,
2019-03-20 09:18:59 -07:00
{
. procname = " fib_sync_mem " ,
. data = & sysctl_fib_sync_mem ,
. maxlen = sizeof ( sysctl_fib_sync_mem ) ,
. mode = 0644 ,
. proc_handler = proc_douintvec_minmax ,
. extra1 = & sysctl_fib_sync_mem_min ,
. extra2 = & sysctl_fib_sync_mem_max ,
} ,
2005-04-16 15:20:36 -07:00
} ;
2007-12-05 01:41:26 -08:00
2008-03-26 01:56:24 -07:00
static struct ctl_table ipv4_net_table [ ] = {
2022-01-26 10:07:14 -08:00
{
. procname = " tcp_max_tw_buckets " ,
2022-09-07 18:10:18 -07:00
. data = & init_net . ipv4 . tcp_death_row . sysctl_max_tw_buckets ,
2022-01-26 10:07:14 -08:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2008-03-26 01:56:24 -07:00
{
. procname = " icmp_echo_ignore_all " ,
. data = & init_net . ipv4 . sysctl_icmp_echo_ignore_all ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2008-03-26 01:56:24 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:22 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
2008-03-26 01:56:24 -07:00
} ,
2021-03-29 18:45:29 -07:00
{
. procname = " icmp_echo_enable_probe " ,
. data = & init_net . ipv4 . sysctl_icmp_echo_enable_probe ,
2021-03-30 14:06:13 -07:00
. maxlen = sizeof ( u8 ) ,
2021-03-29 18:45:29 -07:00
. mode = 0644 ,
2021-03-30 14:06:13 -07:00
. proc_handler = proc_dou8vec_minmax ,
2021-03-29 18:45:29 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
} ,
2008-03-26 01:56:24 -07:00
{
. procname = " icmp_echo_ignore_broadcasts " ,
. data = & init_net . ipv4 . sysctl_icmp_echo_ignore_broadcasts ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2008-03-26 01:56:24 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:24 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ignore_bogus_error_responses " ,
. data = & init_net . ipv4 . sysctl_icmp_ignore_bogus_error_responses ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2008-03-26 01:56:24 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:25 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_errors_use_inbound_ifaddr " ,
. data = & init_net . ipv4 . sysctl_icmp_errors_use_inbound_ifaddr ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2008-03-26 01:56:24 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:26 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ratelimit " ,
. data = & init_net . ipv4 . sysctl_icmp_ratelimit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_ms_jiffies ,
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ratemask " ,
. data = & init_net . ipv4 . sysctl_icmp_ratemask ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
. procname = " ping_group_range " ,
2014-05-06 11:02:50 -07:00
. data = & init_net . ipv4 . ping_group_range . range ,
2012-05-24 10:34:21 -06:00
. maxlen = sizeof ( gid_t ) * 2 ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
. mode = 0644 ,
. proc_handler = ipv4_ping_group_range ,
} ,
2018-11-07 15:36:05 +00:00
# ifdef CONFIG_NET_L3_MASTER_DEV
{
. procname = " raw_l3mdev_accept " ,
. data = & init_net . ipv4 . sysctl_raw_l3mdev_accept ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2018-11-07 15:36:05 +00:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2018-11-07 15:36:05 +00:00
} ,
# endif
2013-01-05 16:10:48 +00:00
{
. procname = " tcp_ecn " ,
. data = & init_net . ipv4 . sysctl_tcp_ecn ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2013-01-05 16:10:48 +00:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:30 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_TWO ,
2013-01-05 16:10:48 +00:00
} ,
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
{
. procname = " tcp_ecn_fallback " ,
. data = & init_net . ipv4 . sysctl_tcp_ecn_fallback ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2022-07-11 17:15:31 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
} ,
2016-02-15 12:11:29 +02:00
{
. procname = " ip_dynaddr " ,
. data = & init_net . ipv4 . sysctl_ip_dynaddr ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-15 12:11:29 +02:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-15 12:11:29 +02:00
} ,
2016-02-15 12:11:30 +02:00
{
. procname = " ip_early_demux " ,
. data = & init_net . ipv4 . sysctl_ip_early_demux ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-15 12:11:30 +02:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-15 12:11:30 +02:00
} ,
2017-03-23 13:34:16 -06:00
{
. procname = " udp_early_demux " ,
. data = & init_net . ipv4 . sysctl_udp_early_demux ,
2021-03-25 11:08:16 -07:00
. maxlen = sizeof ( u8 ) ,
2017-03-23 13:34:16 -06:00
. mode = 0644 ,
tcp/udp: Make early_demux back namespacified.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-13 10:52:07 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-03-23 13:34:16 -06:00
} ,
{
. procname = " tcp_early_demux " ,
. data = & init_net . ipv4 . sysctl_tcp_early_demux ,
2021-03-25 11:08:16 -07:00
. maxlen = sizeof ( u8 ) ,
2017-03-23 13:34:16 -06:00
. mode = 0644 ,
tcp/udp: Make early_demux back namespacified.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-13 10:52:07 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-03-23 13:34:16 -06:00
} ,
2020-04-27 13:56:46 -07:00
{
. procname = " nexthop_compat_mode " ,
. data = & init_net . ipv4 . sysctl_nexthop_compat_mode ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2020-04-27 13:56:46 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2020-04-27 13:56:46 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2016-02-15 12:11:27 +02:00
{
. procname = " ip_default_ttl " ,
. data = & init_net . ipv4 . sysctl_ip_default_ttl ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-15 12:11:27 +02:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-15 12:11:27 +02:00
. extra1 = & ip_ttl_min ,
. extra2 = & ip_ttl_max ,
} ,
2013-09-28 14:10:59 -07:00
{
. procname = " ip_local_port_range " ,
2023-12-06 13:44:20 +00:00
. maxlen = 0 ,
. data = & init_net ,
2013-09-28 14:10:59 -07:00
. mode = 0644 ,
. proc_handler = ipv4_local_port_range ,
} ,
2014-05-12 16:04:53 -07:00
{
. procname = " ip_local_reserved_ports " ,
. data = & init_net . ipv4 . sysctl_local_reserved_ports ,
. maxlen = 65536 ,
. mode = 0644 ,
. proc_handler = proc_do_large_bitmap ,
} ,
2013-12-14 05:13:38 +01:00
{
. procname = " ip_no_pmtu_disc " ,
. data = & init_net . ipv4 . sysctl_ip_no_pmtu_disc ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2013-12-14 05:13:38 +01:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2013-12-14 05:13:38 +01:00
} ,
2014-01-09 10:01:15 +01:00
{
. procname = " ip_forward_use_pmtu " ,
. data = & init_net . ipv4 . sysctl_ip_fwd_use_pmtu ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2014-01-09 10:01:15 +01:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2014-01-09 10:01:15 +01:00
} ,
2018-08-01 00:36:03 +02:00
{
. procname = " ip_forward_update_priority " ,
. data = & init_net . ipv4 . sysctl_ip_fwd_update_priority ,
2021-03-25 11:08:15 -07:00
. maxlen = sizeof ( u8 ) ,
2018-08-01 00:36:03 +02:00
. mode = 0644 ,
2018-08-01 00:36:42 +02:00
. proc_handler = ipv4_fwd_update_priority ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2018-08-01 00:36:03 +02:00
} ,
2014-09-05 15:09:03 +02:00
{
. procname = " ip_nonlocal_bind " ,
. data = & init_net . ipv4 . sysctl_ip_nonlocal_bind ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2014-09-05 15:09:03 +02:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2014-09-05 15:09:03 +02:00
} ,
tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 17:05:25 +09:00
{
. procname = " ip_autobind_reuse " ,
. data = & init_net . ipv4 . sysctl_ip_autobind_reuse ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 17:05:25 +09:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 17:05:25 +09:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2014-05-13 10:17:33 -07:00
{
. procname = " fwmark_reflect " ,
. data = & init_net . ipv4 . sysctl_fwmark_reflect ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. maxlen = sizeof ( u8 ) ,
2014-05-13 10:17:33 -07:00
. mode = 0644 ,
ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 11:08:14 -07:00
. proc_handler = proc_dou8vec_minmax ,
2014-05-13 10:17:33 -07:00
} ,
net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 10:17:35 -07:00
{
. procname = " tcp_fwmark_accept " ,
. data = & init_net . ipv4 . sysctl_tcp_fwmark_accept ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 10:17:35 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 10:17:35 -07:00
} ,
2015-12-16 13:20:44 -08:00
# ifdef CONFIG_NET_L3_MASTER_DEV
{
. procname = " tcp_l3mdev_accept " ,
. data = & init_net . ipv4 . sysctl_tcp_l3mdev_accept ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2015-12-16 13:20:44 -08:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2015-12-16 13:20:44 -08:00
} ,
# endif
2015-02-10 09:53:16 +08:00
{
. procname = " tcp_mtu_probing " ,
. data = & init_net . ipv4 . sysctl_tcp_mtu_probing ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2015-02-10 09:53:16 +08:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2015-02-10 09:53:16 +08:00
} ,
{
. procname = " tcp_base_mss " ,
. data = & init_net . ipv4 . sysctl_tcp_base_mss ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2019-06-06 09:15:31 -07:00
{
. procname = " tcp_min_snd_mss " ,
. data = & init_net . ipv4 . sysctl_tcp_min_snd_mss ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & tcp_min_snd_mss_min ,
. extra2 = & tcp_min_snd_mss_max ,
} ,
2019-08-07 19:52:29 -04:00
{
. procname = " tcp_mtu_probe_floor " ,
. data = & init_net . ipv4 . sysctl_tcp_mtu_probe_floor ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & tcp_min_snd_mss_min ,
. extra2 = & tcp_min_snd_mss_max ,
} ,
2015-03-06 11:18:23 +08:00
{
. procname = " tcp_probe_threshold " ,
. data = & init_net . ipv4 . sysctl_tcp_probe_threshold ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2015-03-06 11:18:24 +08:00
{
. procname = " tcp_probe_interval " ,
. data = & init_net . ipv4 . sysctl_tcp_probe_interval ,
2018-09-25 21:59:28 -07:00
. maxlen = sizeof ( u32 ) ,
2015-03-06 11:18:24 +08:00
. mode = 0644 ,
2018-09-25 21:59:28 -07:00
. proc_handler = proc_douintvec_minmax ,
. extra2 = & u32_max_div_HZ ,
2015-03-06 11:18:24 +08:00
} ,
2015-08-27 16:46:26 +01:00
{
. procname = " igmp_link_local_mcast_reports " ,
2016-02-09 00:13:50 +02:00
. data = & init_net . ipv4 . sysctl_igmp_llm_reports ,
2021-03-31 10:52:10 -07:00
. maxlen = sizeof ( u8 ) ,
2015-08-27 16:46:26 +01:00
. mode = 0644 ,
2021-03-31 10:52:10 -07:00
. proc_handler = proc_dou8vec_minmax ,
2015-08-27 16:46:26 +01:00
} ,
2016-02-08 23:29:21 +02:00
{
. procname = " igmp_max_memberships " ,
. data = & init_net . ipv4 . sysctl_igmp_max_memberships ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2016-02-08 23:29:22 +02:00
{
. procname = " igmp_max_msf " ,
. data = & init_net . ipv4 . sysctl_igmp_max_msf ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2016-02-08 23:29:24 +02:00
# ifdef CONFIG_IP_MULTICAST
{
. procname = " igmp_qrv " ,
. data = & init_net . ipv4 . sysctl_igmp_qrv ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE
2016-02-08 23:29:24 +02:00
} ,
# endif
2017-11-14 08:25:49 -08:00
{
. procname = " tcp_congestion_control " ,
. data = & init_net . ipv4 . tcp_congestion_control ,
. mode = 0644 ,
. maxlen = TCP_CA_NAME_MAX ,
. proc_handler = proc_tcp_congestion_control ,
} ,
2020-02-19 13:02:53 +01:00
{
. procname = " tcp_available_congestion_control " ,
. maxlen = TCP_CA_BUF_MAX ,
. mode = 0444 ,
. proc_handler = proc_tcp_available_congestion_control ,
} ,
{
. procname = " tcp_allowed_congestion_control " ,
. maxlen = TCP_CA_BUF_MAX ,
. mode = 0644 ,
. proc_handler = proc_allowed_congestion_control ,
} ,
2016-01-07 16:38:43 +02:00
{
. procname = " tcp_keepalive_time " ,
. data = & init_net . ipv4 . sysctl_tcp_keepalive_time ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_jiffies ,
} ,
2016-01-07 16:38:44 +02:00
{
. procname = " tcp_keepalive_probes " ,
. data = & init_net . ipv4 . sysctl_tcp_keepalive_probes ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-01-07 16:38:44 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-01-07 16:38:44 +02:00
} ,
2016-01-07 16:38:45 +02:00
{
. procname = " tcp_keepalive_intvl " ,
. data = & init_net . ipv4 . sysctl_tcp_keepalive_intvl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_jiffies ,
} ,
2016-02-03 09:46:49 +02:00
{
. procname = " tcp_syn_retries " ,
. data = & init_net . ipv4 . sysctl_tcp_syn_retries ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:49 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:49 +02:00
. extra1 = & tcp_syn_retries_min ,
. extra2 = & tcp_syn_retries_max
} ,
2016-02-03 09:46:50 +02:00
{
. procname = " tcp_synack_retries " ,
. data = & init_net . ipv4 . sysctl_tcp_synack_retries ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:50 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:50 +02:00
} ,
2016-02-03 09:46:51 +02:00
# ifdef CONFIG_SYN_COOKIES
{
. procname = " tcp_syncookies " ,
. data = & init_net . ipv4 . sysctl_tcp_syncookies ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:51 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:51 +02:00
} ,
# endif
2021-06-12 21:32:14 +09:00
{
. procname = " tcp_migrate_req " ,
. data = & init_net . ipv4 . sysctl_tcp_migrate_req ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE
} ,
2016-02-03 09:46:52 +02:00
{
. procname = " tcp_reordering " ,
. data = & init_net . ipv4 . sysctl_tcp_reordering ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2016-02-03 09:46:53 +02:00
{
. procname = " tcp_retries1 " ,
. data = & init_net . ipv4 . sysctl_tcp_retries1 ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:53 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:53 +02:00
. extra2 = & tcp_retr1_max
} ,
2016-02-03 09:46:54 +02:00
{
. procname = " tcp_retries2 " ,
. data = & init_net . ipv4 . sysctl_tcp_retries2 ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:54 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:54 +02:00
} ,
2016-02-03 09:46:55 +02:00
{
. procname = " tcp_orphan_retries " ,
. data = & init_net . ipv4 . sysctl_tcp_orphan_retries ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-02-03 09:46:55 +02:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2016-02-03 09:46:55 +02:00
} ,
2016-02-03 09:46:56 +02:00
{
. procname = " tcp_fin_timeout " ,
. data = & init_net . ipv4 . sysctl_tcp_fin_timeout ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_jiffies ,
} ,
2016-02-03 09:46:57 +02:00
{
. procname = " tcp_notsent_lowat " ,
. data = & init_net . ipv4 . sysctl_tcp_notsent_lowat ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2017-01-09 10:45:49 +03:00
. proc_handler = proc_douintvec ,
2016-02-03 09:46:57 +02:00
} ,
2016-12-25 14:33:16 +08:00
{
. procname = " tcp_tw_reuse " ,
. data = & init_net . ipv4 . sysctl_tcp_tw_reuse ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2016-12-25 14:33:16 +08:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:22 +08:00
. extra2 = SYSCTL_TWO ,
2016-12-25 14:33:16 +08:00
} ,
2016-12-28 17:52:33 +08:00
{
. procname = " tcp_max_syn_backlog " ,
. data = & init_net . ipv4 . sysctl_max_syn_backlog ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2017-09-27 11:35:40 +08:00
{
. procname = " tcp_fastopen " ,
. data = & init_net . ipv4 . sysctl_tcp_fastopen ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2017-09-27 11:35:42 +08:00
{
. procname = " tcp_fastopen_key " ,
. mode = 0600 ,
. data = & init_net . ipv4 . sysctl_tcp_fastopen ,
2019-05-29 12:33:59 -04:00
/* maxlen to print the list of keys in hex (*2), with dashes
* separating doublewords and a comma in between keys .
*/
. maxlen = ( ( TCP_FASTOPEN_KEY_LENGTH *
2 * TCP_FASTOPEN_KEY_MAX ) +
( TCP_FASTOPEN_KEY_MAX * 5 ) ) ,
2017-09-27 11:35:42 +08:00
. proc_handler = proc_tcp_fastopen_key ,
} ,
2017-09-27 11:35:43 +08:00
{
. procname = " tcp_fastopen_blackhole_timeout_sec " ,
. data = & init_net . ipv4 . sysctl_tcp_fastopen_blackhole_timeout ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_tfo_blackhole_detect_timeout ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2017-09-27 11:35:43 +08:00
} ,
2016-04-07 07:21:00 -07:00
# ifdef CONFIG_IP_ROUTE_MULTIPATH
{
. procname = " fib_multipath_use_neigh " ,
. data = & init_net . ipv4 . sysctl_fib_multipath_use_neigh ,
2021-03-31 10:52:09 -07:00
. maxlen = sizeof ( u8 ) ,
2016-04-07 07:21:00 -07:00
. mode = 0644 ,
2021-03-31 10:52:09 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2017-03-16 15:28:00 +02:00
} ,
{
. procname = " fib_multipath_hash_policy " ,
. data = & init_net . ipv4 . sysctl_fib_multipath_hash_policy ,
2021-03-31 10:52:09 -07:00
. maxlen = sizeof ( u8 ) ,
2017-03-16 15:28:00 +02:00
. mode = 0644 ,
2017-11-02 17:14:05 +01:00
. proc_handler = proc_fib_multipath_hash_policy ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:23 +08:00
. extra2 = SYSCTL_THREE ,
2016-04-07 07:21:00 -07:00
} ,
2021-05-17 21:15:18 +03:00
{
. procname = " fib_multipath_hash_fields " ,
. data = & init_net . ipv4 . sysctl_fib_multipath_hash_fields ,
. maxlen = sizeof ( u32 ) ,
. mode = 0644 ,
2021-05-19 15:08:18 +03:00
. proc_handler = proc_fib_multipath_hash_fields ,
2021-05-17 21:15:18 +03:00
. extra1 = SYSCTL_ONE ,
. extra2 = & fib_multipath_hash_fields_all_mask ,
} ,
2016-04-07 07:21:00 -07:00
# endif
2017-01-20 17:49:11 -08:00
{
. procname = " ip_unprivileged_port_start " ,
. maxlen = sizeof ( int ) ,
. data = & init_net . ipv4 . sysctl_ip_prot_sock ,
. mode = 0644 ,
. proc_handler = ipv4_privileged_ports ,
} ,
2017-01-26 18:02:24 +00:00
# ifdef CONFIG_NET_L3_MASTER_DEV
{
. procname = " udp_l3mdev_accept " ,
. data = & init_net . ipv4 . sysctl_udp_l3mdev_accept ,
2021-03-31 10:52:08 -07:00
. maxlen = sizeof ( u8 ) ,
2017-01-26 18:02:24 +00:00
. mode = 0644 ,
2021-03-31 10:52:08 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2017-01-26 18:02:24 +00:00
} ,
# endif
2017-06-07 10:34:37 -07:00
{
. procname = " tcp_sack " ,
. data = & init_net . ipv4 . sysctl_tcp_sack ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-06-07 10:34:37 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-06-07 10:34:37 -07:00
} ,
2017-06-07 10:34:38 -07:00
{
. procname = " tcp_window_scaling " ,
. data = & init_net . ipv4 . sysctl_tcp_window_scaling ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-06-07 10:34:38 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-06-07 10:34:38 -07:00
} ,
2017-06-07 10:34:39 -07:00
{
. procname = " tcp_timestamps " ,
. data = & init_net . ipv4 . sysctl_tcp_timestamps ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-06-07 10:34:39 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-06-07 10:34:39 -07:00
} ,
2017-10-26 21:54:56 -07:00
{
. procname = " tcp_early_retrans " ,
. data = & init_net . ipv4 . sysctl_tcp_early_retrans ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:54:56 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:22 +08:00
. extra2 = SYSCTL_FOUR ,
2017-10-26 21:54:56 -07:00
} ,
2017-10-26 21:54:57 -07:00
{
. procname = " tcp_recovery " ,
. data = & init_net . ipv4 . sysctl_tcp_recovery ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:54:57 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:54:57 -07:00
} ,
2017-10-26 21:54:58 -07:00
{
. procname = " tcp_thin_linear_timeouts " ,
. data = & init_net . ipv4 . sysctl_tcp_thin_linear_timeouts ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:54:58 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:54:58 -07:00
} ,
2017-10-26 21:54:59 -07:00
{
. procname = " tcp_slow_start_after_idle " ,
. data = & init_net . ipv4 . sysctl_tcp_slow_start_after_idle ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:54:59 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:54:59 -07:00
} ,
2017-10-26 21:55:00 -07:00
{
. procname = " tcp_retrans_collapse " ,
. data = & init_net . ipv4 . sysctl_tcp_retrans_collapse ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:00 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:00 -07:00
} ,
2017-10-26 21:55:01 -07:00
{
. procname = " tcp_stdurg " ,
. data = & init_net . ipv4 . sysctl_tcp_stdurg ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:01 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:01 -07:00
} ,
2017-10-26 21:55:02 -07:00
{
. procname = " tcp_rfc1337 " ,
. data = & init_net . ipv4 . sysctl_tcp_rfc1337 ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:02 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:02 -07:00
} ,
2017-10-26 21:55:03 -07:00
{
. procname = " tcp_abort_on_overflow " ,
. data = & init_net . ipv4 . sysctl_tcp_abort_on_overflow ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:03 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:03 -07:00
} ,
2017-10-26 21:55:04 -07:00
{
. procname = " tcp_fack " ,
. data = & init_net . ipv4 . sysctl_tcp_fack ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:04 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:04 -07:00
} ,
2017-10-26 21:55:06 -07:00
{
. procname = " tcp_max_reordering " ,
. data = & init_net . ipv4 . sysctl_tcp_max_reordering ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2017-10-26 21:55:07 -07:00
{
. procname = " tcp_dsack " ,
. data = & init_net . ipv4 . sysctl_tcp_dsack ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:07 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:07 -07:00
} ,
2017-10-26 21:55:08 -07:00
{
. procname = " tcp_app_win " ,
. data = & init_net . ipv4 . sysctl_tcp_app_win ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:08 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2023-04-06 14:34:50 +08:00
. extra1 = SYSCTL_ZERO ,
. extra2 = & tcp_app_win_max ,
2017-10-26 21:55:08 -07:00
} ,
2017-10-26 21:55:09 -07:00
{
. procname = " tcp_adv_win_scale " ,
. data = & init_net . ipv4 . sysctl_tcp_adv_win_scale ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & tcp_adv_win_scale_min ,
. extra2 = & tcp_adv_win_scale_max ,
} ,
2017-10-26 21:55:10 -07:00
{
. procname = " tcp_frto " ,
. data = & init_net . ipv4 . sysctl_tcp_frto ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-26 21:55:10 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-26 21:55:10 -07:00
} ,
2017-10-27 07:47:21 -07:00
{
. procname = " tcp_no_metrics_save " ,
. data = & init_net . ipv4 . sysctl_tcp_nometrics_save ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:21 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-27 07:47:21 -07:00
} ,
2019-12-09 14:19:59 -05:00
{
. procname = " tcp_no_ssthresh_metrics_save " ,
. data = & init_net . ipv4 . sysctl_tcp_no_ssthresh_metrics_save ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2019-12-09 14:19:59 -05:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2019-12-09 14:19:59 -05:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2017-10-27 07:47:22 -07:00
{
. procname = " tcp_moderate_rcvbuf " ,
. data = & init_net . ipv4 . sysctl_tcp_moderate_rcvbuf ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:22 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-27 07:47:22 -07:00
} ,
2017-10-27 07:47:23 -07:00
{
. procname = " tcp_tso_win_divisor " ,
. data = & init_net . ipv4 . sysctl_tcp_tso_win_divisor ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:23 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-27 07:47:23 -07:00
} ,
2017-10-27 07:47:24 -07:00
{
. procname = " tcp_workaround_signed_windows " ,
. data = & init_net . ipv4 . sysctl_tcp_workaround_signed_windows ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:24 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2017-10-27 07:47:24 -07:00
} ,
2017-10-27 07:47:25 -07:00
{
. procname = " tcp_limit_output_bytes " ,
. data = & init_net . ipv4 . sysctl_tcp_limit_output_bytes ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2017-10-27 07:47:26 -07:00
{
. procname = " tcp_challenge_ack_limit " ,
. data = & init_net . ipv4 . sysctl_tcp_challenge_ack_limit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2017-10-27 07:47:27 -07:00
{
. procname = " tcp_min_tso_segs " ,
. data = & init_net . ipv4 . sysctl_tcp_min_tso_segs ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:27 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE ,
2017-10-27 07:47:27 -07:00
} ,
tcp: adjust TSO packet sizes based on min_rtt
Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.
The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.
Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()
Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.
This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.
/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.
Tested:
100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.
Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96005
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
65,945.29 msec task-clock # 2.845 CPUs utilized
1,314,632 context-switches # 19935.279 M/sec
5,292 cpu-migrations # 80.249 M/sec
940,641 page-faults # 14264.023 M/sec
201,117,030,926 cycles # 3049769.216 GHz (83.45%)
17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%)
136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%)
53,809,530,436 instructions # 0.27 insn per cycle
# 2.54 stalled cycles per insn (83.36%)
9,062,315,523 branches # 137422329.563 M/sec (83.22%)
153,008,621 branch-misses # 1.69% of all branches (83.32%)
23.182970846 seconds time elapsed
TcpInSegs 15648792 0.0
TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered 58654791 0.0
TcpExtTCPDeliveredCE 19 0.0
After patch:
~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96046
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
48,982.58 msec task-clock # 2.104 CPUs utilized
186,014 context-switches # 3797.599 M/sec
3,109 cpu-migrations # 63.472 M/sec
941,180 page-faults # 19214.814 M/sec
153,459,763,868 cycles # 3132982.807 GHz (83.56%)
12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%)
120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%)
36,803,672,106 instructions # 0.24 insn per cycle
# 3.27 stalled cycles per insn (83.18%)
5,947,266,275 branches # 121417383.427 M/sec (83.64%)
87,984,616 branch-misses # 1.48% of all branches (83.43%)
23.281200256 seconds time elapsed
TcpInSegs 1434706 0.0
TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered 58878971 0.0
TcpExtTCPDeliveredCE 9664 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08 17:57:57 -08:00
{
. procname = " tcp_tso_rtt_log " ,
. data = & init_net . ipv4 . sysctl_tcp_tso_rtt_log ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
} ,
2017-10-27 07:47:28 -07:00
{
. procname = " tcp_min_rtt_wlen " ,
. data = & init_net . ipv4 . sysctl_tcp_min_rtt_wlen ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2019-04-16 09:47:24 +08:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2019-04-16 09:47:24 +08:00
. extra2 = & one_day_secs
2017-10-27 07:47:28 -07:00
} ,
2017-10-27 07:47:29 -07:00
{
. procname = " tcp_autocorking " ,
. data = & init_net . ipv4 . sysctl_tcp_autocorking ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2017-10-27 07:47:29 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
2017-10-27 07:47:29 -07:00
} ,
2017-10-27 07:47:30 -07:00
{
. procname = " tcp_invalid_ratelimit " ,
. data = & init_net . ipv4 . sysctl_tcp_invalid_ratelimit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_ms_jiffies ,
} ,
2017-10-27 07:47:31 -07:00
{
. procname = " tcp_pacing_ss_ratio " ,
. data = & init_net . ipv4 . sysctl_tcp_pacing_ss_ratio ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:22 +08:00
. extra2 = SYSCTL_ONE_THOUSAND ,
2017-10-27 07:47:31 -07:00
} ,
2017-10-27 07:47:32 -07:00
{
. procname = " tcp_pacing_ca_ratio " ,
. data = & init_net . ipv4 . sysctl_tcp_pacing_ca_ratio ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:22 +08:00
. extra2 = SYSCTL_ONE_THOUSAND ,
2017-10-27 07:47:32 -07:00
} ,
2017-11-07 00:29:28 -08:00
{
. procname = " tcp_wmem " ,
. data = & init_net . ipv4 . sysctl_tcp_wmem ,
. maxlen = sizeof ( init_net . ipv4 . sysctl_tcp_wmem ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE ,
2017-11-07 00:29:28 -08:00
} ,
{
. procname = " tcp_rmem " ,
. data = & init_net . ipv4 . sysctl_tcp_rmem ,
. maxlen = sizeof ( init_net . ipv4 . sysctl_tcp_rmem ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE ,
2017-11-07 00:29:28 -08:00
} ,
2018-05-17 14:47:28 -07:00
{
. procname = " tcp_comp_sack_delay_ns " ,
. data = & init_net . ipv4 . sysctl_tcp_comp_sack_delay_ns ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
2020-04-30 10:35:43 -07:00
{
. procname = " tcp_comp_sack_slack_ns " ,
. data = & init_net . ipv4 . sysctl_tcp_comp_sack_slack_ns ,
. maxlen = sizeof ( unsigned long ) ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
2018-05-17 14:47:29 -07:00
{
. procname = " tcp_comp_sack_nr " ,
. data = & init_net . ipv4 . sysctl_tcp_comp_sack_nr ,
2021-03-31 10:52:11 -07:00
. maxlen = sizeof ( u8 ) ,
2018-05-17 14:47:29 -07:00
. mode = 0644 ,
2021-03-31 10:52:11 -07:00
. proc_handler = proc_dou8vec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2018-05-17 14:47:29 -07:00
} ,
tcp: defer regular ACK while processing socket backlog
This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.
For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)
Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.
Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.
Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.
This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.
The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().
This saves cpu cycles on both sides, and network resources.
Performance of a single TCP flow on a 200Gbit NIC:
- Throughput is increased by 20% (100Gbit -> 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.
Benchmark context:
- Regular netperf TCP_STREAM (no zerocopy)
- Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
- MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
This feature is guarded by a new sysctl, and enabled by default:
/proc/sys/net/ipv4/tcp_backlog_ack_defer
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-11 17:05:31 +00:00
{
. procname = " tcp_backlog_ack_defer " ,
. data = & init_net . ipv4 . sysctl_tcp_backlog_ack_defer ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2020-09-09 17:50:48 -07:00
{
. procname = " tcp_reflect_tos " ,
. data = & init_net . ipv4 . sysctl_tcp_reflect_tos ,
2021-03-25 11:08:17 -07:00
. maxlen = sizeof ( u8 ) ,
2020-09-09 17:50:48 -07:00
. mode = 0644 ,
2021-03-25 11:08:17 -07:00
. proc_handler = proc_dou8vec_minmax ,
2020-09-09 17:50:48 -07:00
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
{
. procname = " tcp_ehash_entries " ,
. data = & init_net . ipv4 . sysctl_tcp_child_ehash_entries ,
. mode = 0444 ,
. proc_handler = proc_tcp_ehash_entries ,
} ,
{
. procname = " tcp_child_ehash_entries " ,
. data = & init_net . ipv4 . sysctl_tcp_child_ehash_entries ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
. proc_handler = proc_douintvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = & tcp_child_ehash_entries_max ,
} ,
udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 13:57:57 -08:00
{
. procname = " udp_hash_entries " ,
. data = & init_net . ipv4 . sysctl_udp_child_hash_entries ,
. mode = 0444 ,
. proc_handler = proc_udp_hash_entries ,
} ,
{
. procname = " udp_child_hash_entries " ,
. data = & init_net . ipv4 . sysctl_udp_child_hash_entries ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
. proc_handler = proc_douintvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = & udp_child_hash_entries_max ,
} ,
2018-03-13 21:57:16 -07:00
{
. procname = " udp_rmem_min " ,
. data = & init_net . ipv4 . sysctl_udp_rmem_min ,
. maxlen = sizeof ( init_net . ipv4 . sysctl_udp_rmem_min ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE
2018-03-13 21:57:16 -07:00
} ,
{
. procname = " udp_wmem_min " ,
. data = & init_net . ipv4 . sysctl_udp_wmem_min ,
. maxlen = sizeof ( init_net . ipv4 . sysctl_udp_wmem_min ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ONE
2018-03-13 21:57:16 -07:00
} ,
2021-02-01 21:47:52 +02:00
{
. procname = " fib_notify_on_flag_change " ,
. data = & init_net . ipv4 . sysctl_fib_notify_on_flag_change ,
2021-03-31 10:52:07 -07:00
. maxlen = sizeof ( u8 ) ,
2021-02-01 21:47:52 +02:00
. mode = 0644 ,
2021-03-31 10:52:07 -07:00
. proc_handler = proc_dou8vec_minmax ,
2021-02-01 21:47:52 +02:00
. extra1 = SYSCTL_ZERO ,
2022-05-01 11:55:22 +08:00
. extra2 = SYSCTL_TWO ,
2021-02-01 21:47:52 +02:00
} ,
2022-10-26 13:51:11 +00:00
{
. procname = " tcp_plb_enabled " ,
. data = & init_net . ipv4 . sysctl_tcp_plb_enabled ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
{
. procname = " tcp_plb_idle_rehash_rounds " ,
. data = & init_net . ipv4 . sysctl_tcp_plb_idle_rehash_rounds ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra2 = & tcp_plb_max_rounds ,
} ,
{
. procname = " tcp_plb_rehash_rounds " ,
. data = & init_net . ipv4 . sysctl_tcp_plb_rehash_rounds ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra2 = & tcp_plb_max_rounds ,
} ,
{
. procname = " tcp_plb_suspend_rto_sec " ,
. data = & init_net . ipv4 . sysctl_tcp_plb_suspend_rto_sec ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
} ,
{
. procname = " tcp_plb_cong_thresh " ,
. data = & init_net . ipv4 . sysctl_tcp_plb_cong_thresh ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = & tcp_plb_max_cong_thresh ,
} ,
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
{
2023-06-06 18:12:33 +00:00
. procname = " tcp_syn_linear_timeouts " ,
. data = & init_net . ipv4 . sysctl_tcp_syn_linear_timeouts ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = & tcp_syn_linear_timeouts_max ,
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
} ,
tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.
To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].
As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session. A receive buffer full condition should instead result
in a zero window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.
But this problem does show up for other types of flows. Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time. In these cases, we directly
encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window
Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323
Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-11 22:05:24 -05:00
{
. procname = " tcp_shrink_window " ,
. data = & init_net . ipv4 . sysctl_tcp_shrink_window ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ZERO ,
. extra2 = SYSCTL_ONE ,
} ,
2023-10-11 13:30:44 -07:00
{
. procname = " tcp_pingpong_thresh " ,
. data = & init_net . ipv4 . sysctl_tcp_pingpong_thresh ,
. maxlen = sizeof ( u8 ) ,
. mode = 0644 ,
. proc_handler = proc_dou8vec_minmax ,
. extra1 = SYSCTL_ONE ,
} ,
2008-03-26 01:56:24 -07:00
} ;
2008-03-26 01:54:18 -07:00
static __net_init int ipv4_sysctl_init_net ( struct net * net )
{
2024-05-01 11:29:26 +02:00
size_t table_size = ARRAY_SIZE ( ipv4_net_table ) ;
2008-03-26 01:56:24 -07:00
struct ctl_table * table ;
table = ipv4_net_table ;
2009-11-25 15:14:13 -08:00
if ( ! net_eq ( net , & init_net ) ) {
2013-10-19 16:27:03 -07:00
int i ;
2008-03-26 01:56:24 -07:00
table = kmemdup ( table , sizeof ( ipv4_net_table ) , GFP_KERNEL ) ;
2015-04-03 09:17:26 +01:00
if ( ! table )
2008-03-26 01:56:24 -07:00
goto err_alloc ;
2024-05-01 11:29:26 +02:00
for ( i = 0 ; i < table_size ; i + + ) {
net: Make tcp_allowed_congestion_control readonly in non-init netns
Currently, tcp_allowed_congestion_control is global and writable;
writing to it in any net namespace will leak into all other net
namespaces.
tcp_available_congestion_control and tcp_allowed_congestion_control are
the only sysctls in ipv4_net_table (the per-netns sysctl table) with a
NULL data pointer; their handlers (proc_tcp_available_congestion_control
and proc_allowed_congestion_control) have no other way of referencing a
struct net. Thus, they operate globally.
Because ipv4_net_table does not use designated initializers, there is no
easy way to fix up this one "bad" table entry. However, the data pointer
updating logic shouldn't be applied to NULL pointers anyway, so we
instead force these entries to be read-only.
These sysctls used to exist in ipv4_table (init-net only), but they were
moved to the per-net ipv4_net_table, presumably without realizing that
tcp_allowed_congestion_control was writable and thus introduced a leak.
Because the intent of that commit was only to know (i.e. read) "which
congestion algorithms are available or allowed", this read-only solution
should be sufficient.
The logic added in recent commit
31c4d2f160eb: ("net: Ensure net namespace isolation of sysctls")
does not and cannot check for NULL data pointers, because
other table entries (e.g. /proc/sys/net/netfilter/nf_log/) have
.data=NULL but use other methods (.extra2) to access the struct net.
Fixes: 9cb8e048e5d9 ("net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 03:08:48 -04:00
if ( table [ i ] . data ) {
/* Update the variables to point into
* the current struct net
*/
table [ i ] . data + = ( void * ) net - ( void * ) & init_net ;
} else {
/* Entries without data pointer are global;
* Make them read - only in non - init_net ns
*/
table [ i ] . mode & = ~ 0222 ;
}
}
2008-03-26 01:56:24 -07:00
}
2023-08-09 12:50:03 +02:00
net - > ipv4 . ipv4_hdr = register_net_sysctl_sz ( net , " net/ipv4 " , table ,
2024-05-01 11:29:26 +02:00
table_size ) ;
2015-04-03 09:17:26 +01:00
if ( ! net - > ipv4 . ipv4_hdr )
2008-03-26 01:56:24 -07:00
goto err_reg ;
2014-05-12 16:04:53 -07:00
net - > ipv4 . sysctl_local_reserved_ports = kzalloc ( 65536 / 8 , GFP_KERNEL ) ;
if ( ! net - > ipv4 . sysctl_local_reserved_ports )
goto err_ports ;
2008-03-26 01:54:18 -07:00
return 0 ;
2008-03-26 01:56:24 -07:00
2014-05-12 16:04:53 -07:00
err_ports :
unregister_net_sysctl_table ( net - > ipv4 . ipv4_hdr ) ;
2008-03-26 01:56:24 -07:00
err_reg :
2009-11-25 15:14:13 -08:00
if ( ! net_eq ( net , & init_net ) )
2008-03-26 01:56:24 -07:00
kfree ( table ) ;
err_alloc :
return - ENOMEM ;
2008-03-26 01:54:18 -07:00
}
static __net_exit void ipv4_sysctl_exit_net ( struct net * net )
{
2024-04-18 11:40:08 +02:00
const struct ctl_table * table ;
2008-03-26 01:56:24 -07:00
2014-05-12 16:04:53 -07:00
kfree ( net - > ipv4 . sysctl_local_reserved_ports ) ;
2008-03-26 01:56:24 -07:00
table = net - > ipv4 . ipv4_hdr - > ctl_table_arg ;
unregister_net_sysctl_table ( net - > ipv4 . ipv4_hdr ) ;
kfree ( table ) ;
2008-03-26 01:54:18 -07:00
}
static __net_initdata struct pernet_operations ipv4_sysctl_ops = {
. init = ipv4_sysctl_init_net ,
. exit = ipv4_sysctl_exit_net ,
} ;
2007-12-05 01:41:26 -08:00
static __init int sysctl_ipv4_init ( void )
{
struct ctl_table_header * hdr ;
2012-04-19 13:44:49 +00:00
hdr = register_net_sysctl ( & init_net , " net/ipv4 " , ipv4_table ) ;
2015-04-03 09:17:26 +01:00
if ( ! hdr )
2008-03-26 01:54:18 -07:00
return - ENOMEM ;
if ( register_pernet_subsys ( & ipv4_sysctl_ops ) ) {
2012-04-19 13:24:33 +00:00
unregister_net_sysctl_table ( hdr ) ;
2008-03-26 01:54:18 -07:00
return - ENOMEM ;
}
return 0 ;
2007-12-05 01:41:26 -08:00
}
__initcall ( sysctl_ipv4_init ) ;