2005-04-16 15:20:36 -07:00
/*
* sysctl_net_ipv4 . c : sysctl interface to net IPV4 subsystem .
*
* Begun April 1 , 1996 , Mike Shaver .
* Added / proc / sys / net / ipv4 directory entry ( empty = ) ) . [ MS ]
*/
# include <linux/mm.h>
# include <linux/module.h>
# include <linux/sysctl.h>
2005-08-16 02:18:02 -03:00
# include <linux/igmp.h>
2005-12-27 02:43:12 -02:00
# include <linux/inetdevice.h>
2007-10-10 17:30:46 -07:00
# include <linux/seqlock.h>
2007-12-05 01:41:26 -08:00
# include <linux/init.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
# include <linux/nsproxy.h>
2011-12-11 21:47:05 +00:00
# include <linux/swap.h>
2005-04-16 15:20:36 -07:00
# include <net/snmp.h>
2005-08-16 02:18:02 -03:00
# include <net/icmp.h>
2005-04-16 15:20:36 -07:00
# include <net/ip.h>
# include <net/route.h>
# include <net/tcp.h>
2007-12-31 00:29:24 -08:00
# include <net/udp.h>
2006-08-03 16:48:06 -07:00
# include <net/cipso_ipv4.h>
2007-10-15 02:33:45 -07:00
# include <net/inet_frag.h>
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
# include <net/ping.h>
2011-12-11 21:47:06 +00:00
# include <net/tcp_memcontrol.h>
2005-04-16 15:20:36 -07:00
2005-12-13 23:14:27 -08:00
static int zero ;
2013-01-23 20:35:28 +00:00
static int one = 1 ;
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
static int four = 4 ;
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 05:46:32 -07:00
static int gso_max_segs = GSO_MAX_SEGS ;
2007-02-09 23:24:47 +09:00
static int tcp_retr1_max = 255 ;
2005-04-16 15:20:36 -07:00
static int ip_local_port_range_min [ ] = { 1 , 1 } ;
static int ip_local_port_range_max [ ] = { 65535 , 65535 } ;
2010-11-22 12:54:21 +00:00
static int tcp_adv_win_scale_min = - 31 ;
static int tcp_adv_win_scale_max = 31 ;
2010-12-13 12:16:14 -08:00
static int ip_ttl_min = 1 ;
static int ip_ttl_max = 255 ;
2013-07-19 14:09:01 +02:00
static int tcp_syn_retries_min = 1 ;
static int tcp_syn_retries_max = MAX_TCP_SYNCNT ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
static int ip_ping_group_range_min [ ] = { 0 , 0 } ;
static int ip_ping_group_range_max [ ] = { GID_T_MAX , GID_T_MAX } ;
2015-05-27 22:16:49 +03:00
static int min_sndbuf = SOCK_MIN_SNDBUF ;
static int min_rcvbuf = SOCK_MIN_RCVBUF ;
2005-04-16 15:20:36 -07:00
2007-10-10 17:30:46 -07:00
/* Update system visible IP port range */
2013-09-28 14:10:59 -07:00
static void set_local_port_range ( struct net * net , int range [ 2 ] )
2007-10-10 17:30:46 -07:00
{
2015-05-27 11:34:37 -07:00
bool same_parity = ! ( ( range [ 0 ] ^ range [ 1 ] ) & 1 ) ;
2014-05-06 11:02:49 -07:00
write_seqlock ( & net - > ipv4 . ip_local_ports . lock ) ;
2015-05-27 11:34:37 -07:00
if ( same_parity & & ! net - > ipv4 . ip_local_ports . warned ) {
net - > ipv4 . ip_local_ports . warned = true ;
pr_err_ratelimited ( " ip_local_port_range: prefer different parity for start/end values. \n " ) ;
}
2014-05-06 11:02:49 -07:00
net - > ipv4 . ip_local_ports . range [ 0 ] = range [ 0 ] ;
net - > ipv4 . ip_local_ports . range [ 1 ] = range [ 1 ] ;
write_sequnlock ( & net - > ipv4 . ip_local_ports . lock ) ;
2007-10-10 17:30:46 -07:00
}
/* Validate changes from /proc interface. */
2013-06-11 23:04:25 -07:00
static int ipv4_local_port_range ( struct ctl_table * table , int write ,
2007-10-10 17:30:46 -07:00
void __user * buffer ,
size_t * lenp , loff_t * ppos )
{
2013-09-28 14:10:59 -07:00
struct net * net =
2014-05-06 11:02:49 -07:00
container_of ( table - > data , struct net , ipv4 . ip_local_ports . range ) ;
2007-10-10 17:30:46 -07:00
int ret ;
2008-10-08 14:18:04 -07:00
int range [ 2 ] ;
2013-06-11 23:04:25 -07:00
struct ctl_table tmp = {
2007-10-10 17:30:46 -07:00
. data = & range ,
. maxlen = sizeof ( range ) ,
. mode = table - > mode ,
. extra1 = & ip_local_port_range_min ,
. extra2 = & ip_local_port_range_max ,
} ;
2013-09-28 14:10:59 -07:00
inet_get_local_port_range ( net , & range [ 0 ] , & range [ 1 ] ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dointvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
2007-10-10 17:30:46 -07:00
if ( write & & ret = = 0 ) {
2007-10-18 22:00:17 -07:00
if ( range [ 1 ] < range [ 0 ] )
2007-10-10 17:30:46 -07:00
ret = - EINVAL ;
else
2013-09-28 14:10:59 -07:00
set_local_port_range ( net , range ) ;
2007-10-10 17:30:46 -07:00
}
return ret ;
}
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
2012-05-24 10:34:21 -06:00
static void inet_get_ping_group_range_table ( struct ctl_table * table , kgid_t * low , kgid_t * high )
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
2012-05-24 10:34:21 -06:00
kgid_t * data = table - > data ;
2013-09-28 14:10:59 -07:00
struct net * net =
2014-05-06 11:02:50 -07:00
container_of ( table - > data , struct net , ipv4 . ping_group_range . range ) ;
2012-04-15 05:58:06 +00:00
unsigned int seq ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
do {
2014-05-06 11:02:49 -07:00
seq = read_seqbegin ( & net - > ipv4 . ip_local_ports . lock ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
* low = data [ 0 ] ;
* high = data [ 1 ] ;
2014-05-06 11:02:49 -07:00
} while ( read_seqretry ( & net - > ipv4 . ip_local_ports . lock , seq ) ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
}
/* Update system visible IP port range */
2012-05-24 10:34:21 -06:00
static void set_ping_group_range ( struct ctl_table * table , kgid_t low , kgid_t high )
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
2012-05-24 10:34:21 -06:00
kgid_t * data = table - > data ;
2013-09-28 14:10:59 -07:00
struct net * net =
2014-05-06 11:02:50 -07:00
container_of ( table - > data , struct net , ipv4 . ping_group_range . range ) ;
2014-05-06 11:02:49 -07:00
write_seqlock ( & net - > ipv4 . ip_local_ports . lock ) ;
2012-05-24 10:34:21 -06:00
data [ 0 ] = low ;
data [ 1 ] = high ;
2014-05-06 11:02:49 -07:00
write_sequnlock ( & net - > ipv4 . ip_local_ports . lock ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
}
/* Validate changes from /proc interface. */
2013-06-11 23:04:25 -07:00
static int ipv4_ping_group_range ( struct ctl_table * table , int write ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
void __user * buffer ,
size_t * lenp , loff_t * ppos )
{
2012-05-24 10:34:21 -06:00
struct user_namespace * user_ns = current_user_ns ( ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
int ret ;
2012-05-24 10:34:21 -06:00
gid_t urange [ 2 ] ;
kgid_t low , high ;
2013-06-11 23:04:25 -07:00
struct ctl_table tmp = {
2012-05-24 10:34:21 -06:00
. data = & urange ,
. maxlen = sizeof ( urange ) ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
. mode = table - > mode ,
. extra1 = & ip_ping_group_range_min ,
. extra2 = & ip_ping_group_range_max ,
} ;
2012-05-24 10:34:21 -06:00
inet_get_ping_group_range_table ( table , & low , & high ) ;
urange [ 0 ] = from_kgid_munged ( user_ns , low ) ;
urange [ 1 ] = from_kgid_munged ( user_ns , high ) ;
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
ret = proc_dointvec_minmax ( & tmp , write , buffer , lenp , ppos ) ;
2012-05-24 10:34:21 -06:00
if ( write & & ret = = 0 ) {
low = make_kgid ( user_ns , urange [ 0 ] ) ;
high = make_kgid ( user_ns , urange [ 1 ] ) ;
if ( ! gid_valid ( low ) | | ! gid_valid ( high ) | |
( urange [ 1 ] < urange [ 0 ] ) | | gid_lt ( high , low ) ) {
low = make_kgid ( & init_user_ns , 1 ) ;
high = make_kgid ( & init_user_ns , 0 ) ;
}
set_ping_group_range ( table , low , high ) ;
}
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_tcp_congestion_control ( struct ctl_table * ctl , int write ,
2005-06-23 12:19:55 -07:00
void __user * buffer , size_t * lenp , loff_t * ppos )
{
char val [ TCP_CA_NAME_MAX ] ;
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = {
2005-06-23 12:19:55 -07:00
. data = val ,
. maxlen = TCP_CA_NAME_MAX ,
} ;
int ret ;
tcp_get_default_congestion_control ( val ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2005-06-23 12:19:55 -07:00
if ( write & & ret = = 0 )
ret = tcp_set_default_congestion_control ( val ) ;
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_tcp_available_congestion_control ( struct ctl_table * ctl ,
2009-09-23 15:57:19 -07:00
int write ,
2006-11-09 16:32:06 -08:00
void __user * buffer , size_t * lenp ,
loff_t * ppos )
{
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = { . maxlen = TCP_CA_BUF_MAX , } ;
2006-11-09 16:32:06 -08:00
int ret ;
tbl . data = kmalloc ( tbl . maxlen , GFP_USER ) ;
if ( ! tbl . data )
return - ENOMEM ;
tcp_get_available_congestion_control ( tbl . data , TCP_CA_BUF_MAX ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2006-11-09 16:32:06 -08:00
kfree ( tbl . data ) ;
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_allowed_congestion_control ( struct ctl_table * ctl ,
2009-09-23 15:57:19 -07:00
int write ,
2006-11-09 16:35:15 -08:00
void __user * buffer , size_t * lenp ,
loff_t * ppos )
{
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = { . maxlen = TCP_CA_BUF_MAX } ;
2006-11-09 16:35:15 -08:00
int ret ;
tbl . data = kmalloc ( tbl . maxlen , GFP_USER ) ;
if ( ! tbl . data )
return - ENOMEM ;
tcp_get_allowed_congestion_control ( tbl . data , tbl . maxlen ) ;
2009-09-23 15:57:19 -07:00
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
2006-11-09 16:35:15 -08:00
if ( write & & ret = = 0 )
ret = tcp_set_allowed_congestion_control ( tbl . data ) ;
kfree ( tbl . data ) ;
return ret ;
}
2013-06-11 23:04:25 -07:00
static int proc_tcp_fastopen_key ( struct ctl_table * ctl , int write ,
void __user * buffer , size_t * lenp ,
loff_t * ppos )
2012-08-31 12:29:11 +00:00
{
2013-06-11 23:04:25 -07:00
struct ctl_table tbl = { . maxlen = ( TCP_FASTOPEN_KEY_LENGTH * 2 + 10 ) } ;
2012-08-31 12:29:11 +00:00
struct tcp_fastopen_context * ctxt ;
int ret ;
u32 user_key [ 4 ] ; /* 16 bytes, matching TCP_FASTOPEN_KEY_LENGTH */
tbl . data = kmalloc ( tbl . maxlen , GFP_KERNEL ) ;
if ( ! tbl . data )
return - ENOMEM ;
rcu_read_lock ( ) ;
ctxt = rcu_dereference ( tcp_fastopen_ctx ) ;
if ( ctxt )
memcpy ( user_key , ctxt - > key , TCP_FASTOPEN_KEY_LENGTH ) ;
2012-10-11 06:24:14 +00:00
else
memset ( user_key , 0 , sizeof ( user_key ) ) ;
2012-08-31 12:29:11 +00:00
rcu_read_unlock ( ) ;
snprintf ( tbl . data , tbl . maxlen , " %08x-%08x-%08x-%08x " ,
user_key [ 0 ] , user_key [ 1 ] , user_key [ 2 ] , user_key [ 3 ] ) ;
ret = proc_dostring ( & tbl , write , buffer , lenp , ppos ) ;
if ( write & & ret = = 0 ) {
if ( sscanf ( tbl . data , " %x-%x-%x-%x " , user_key , user_key + 1 ,
user_key + 2 , user_key + 3 ) ! = 4 ) {
ret = - EINVAL ;
goto bad_key ;
}
2013-10-19 21:48:58 +02:00
/* Generate a dummy secret but don't publish it. This
* is needed so we don ' t regenerate a new key on the
* first invocation of tcp_fastopen_cookie_gen
*/
tcp_fastopen_init_key_once ( false ) ;
2012-08-31 12:29:11 +00:00
tcp_fastopen_reset_cipher ( user_key , TCP_FASTOPEN_KEY_LENGTH ) ;
}
bad_key :
pr_debug ( " proc FO key set 0x%x-%x-%x-%x <- 0x%s: %u \n " ,
user_key [ 0 ] , user_key [ 1 ] , user_key [ 2 ] , user_key [ 3 ] ,
( char * ) tbl . data , ret ) ;
kfree ( tbl . data ) ;
return ret ;
}
2007-12-05 01:41:26 -08:00
static struct ctl_table ipv4_table [ ] = {
2007-02-09 23:24:47 +09:00
{
2005-04-16 15:20:36 -07:00
. procname = " tcp_timestamps " ,
. data = & sysctl_tcp_timestamps ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2007-02-09 23:24:47 +09:00
{
2005-04-16 15:20:36 -07:00
. procname = " tcp_window_scaling " ,
. data = & sysctl_tcp_window_scaling ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2007-02-09 23:24:47 +09:00
{
2005-04-16 15:20:36 -07:00
. procname = " tcp_sack " ,
. data = & sysctl_tcp_sack ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2007-02-09 23:24:47 +09:00
{
2005-04-16 15:20:36 -07:00
. procname = " tcp_retrans_collapse " ,
. data = & sysctl_tcp_retrans_collapse ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2007-02-09 23:24:47 +09:00
{
2005-04-16 15:20:36 -07:00
. procname = " ip_default_ttl " ,
2007-02-09 23:24:47 +09:00
. data = & sysctl_ip_default_ttl ,
2005-04-16 15:20:36 -07:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2010-12-13 12:16:14 -08:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = & ip_ttl_min ,
. extra2 = & ip_ttl_max ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_syn_retries " ,
. data = & sysctl_tcp_syn_retries ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2013-07-19 14:09:01 +02:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = & tcp_syn_retries_min ,
. extra2 = & tcp_syn_retries_max
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_synack_retries " ,
. data = & sysctl_tcp_synack_retries ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_max_orphans " ,
. data = & sysctl_tcp_max_orphans ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_max_tw_buckets " ,
2005-08-09 20:44:40 -07:00
. data = & tcp_death_row . sysctl_max_tw_buckets ,
2005-04-16 15:20:36 -07:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2012-06-21 13:58:31 +00:00
{
. procname = " ip_early_demux " ,
. data = & sysctl_ip_early_demux ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " ip_dynaddr " ,
. data = & sysctl_ip_dynaddr ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_keepalive_time " ,
. data = & sysctl_tcp_keepalive_time ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_keepalive_probes " ,
. data = & sysctl_tcp_keepalive_probes ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_keepalive_intvl " ,
. data = & sysctl_tcp_keepalive_intvl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_retries1 " ,
. data = & sysctl_tcp_retries1 ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_minmax ,
2005-04-16 15:20:36 -07:00
. extra2 = & tcp_retr1_max
} ,
{
. procname = " tcp_retries2 " ,
. data = & sysctl_tcp_retries2 ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_fin_timeout " ,
. data = & sysctl_tcp_fin_timeout ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
# ifdef CONFIG_SYN_COOKIES
{
. procname = " tcp_syncookies " ,
. data = & sysctl_tcp_syncookies ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
# endif
2012-07-19 06:43:05 +00:00
{
. procname = " tcp_fastopen " ,
. data = & sysctl_tcp_fastopen ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2012-08-31 12:29:11 +00:00
{
. procname = " tcp_fastopen_key " ,
. mode = 0600 ,
. maxlen = ( ( TCP_FASTOPEN_KEY_LENGTH * 2 ) + 10 ) ,
. proc_handler = proc_tcp_fastopen_key ,
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_tw_recycle " ,
2005-08-09 20:44:40 -07:00
. data = & tcp_death_row . sysctl_tw_recycle ,
2005-04-16 15:20:36 -07:00
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_abort_on_overflow " ,
. data = & sysctl_tcp_abort_on_overflow ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_stdurg " ,
. data = & sysctl_tcp_stdurg ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_rfc1337 " ,
. data = & sysctl_tcp_rfc1337 ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_max_syn_backlog " ,
. data = & sysctl_max_syn_backlog ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " igmp_max_memberships " ,
. data = & sysctl_igmp_max_memberships ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " igmp_max_msf " ,
. data = & sysctl_igmp_max_msf ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2014-09-02 15:49:26 +02:00
# ifdef CONFIG_IP_MULTICAST
{
. procname = " igmp_qrv " ,
. data = & sysctl_igmp_qrv ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & one
} ,
# endif
2005-04-16 15:20:36 -07:00
{
. procname = " inet_peer_threshold " ,
. data = & inet_peer_threshold ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " inet_peer_minttl " ,
. data = & inet_peer_minttl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " inet_peer_maxttl " ,
. data = & inet_peer_maxttl ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_jiffies ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_orphan_retries " ,
. data = & sysctl_tcp_orphan_retries ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_fack " ,
. data = & sysctl_tcp_fack ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_reordering " ,
. data = & sysctl_tcp_reordering ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2014-10-27 21:45:24 -07:00
{
. procname = " tcp_max_reordering " ,
. data = & sysctl_tcp_max_reordering ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_dsack " ,
. data = & sysctl_tcp_dsack ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
2013-10-19 16:25:36 -07:00
{
. procname = " tcp_mem " ,
. maxlen = sizeof ( sysctl_tcp_mem ) ,
. data = & sysctl_tcp_mem ,
. mode = 0644 ,
. proc_handler = proc_doulongvec_minmax ,
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_wmem " ,
. data = & sysctl_tcp_wmem ,
. maxlen = sizeof ( sysctl_tcp_wmem ) ,
. mode = 0644 ,
2013-01-23 20:35:28 +00:00
. proc_handler = proc_dointvec_minmax ,
2015-05-27 22:16:49 +03:00
. extra1 = & min_sndbuf ,
2005-04-16 15:20:36 -07:00
} ,
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 20:27:07 -07:00
{
. procname = " tcp_notsent_lowat " ,
. data = & sysctl_tcp_notsent_lowat ,
. maxlen = sizeof ( sysctl_tcp_notsent_lowat ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2005-04-16 15:20:36 -07:00
{
. procname = " tcp_rmem " ,
. data = & sysctl_tcp_rmem ,
. maxlen = sizeof ( sysctl_tcp_rmem ) ,
. mode = 0644 ,
2013-01-23 20:35:28 +00:00
. proc_handler = proc_dointvec_minmax ,
2015-05-27 22:16:49 +03:00
. extra1 = & min_rcvbuf ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_app_win " ,
. data = & sysctl_tcp_app_win ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_adv_win_scale " ,
. data = & sysctl_tcp_adv_win_scale ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2010-11-22 12:54:21 +00:00
. proc_handler = proc_dointvec_minmax ,
. extra1 = & tcp_adv_win_scale_min ,
. extra2 = & tcp_adv_win_scale_max ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_tw_reuse " ,
. data = & sysctl_tcp_tw_reuse ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_frto " ,
. data = & sysctl_tcp_frto ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_low_latency " ,
. data = & sysctl_tcp_low_latency ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_no_metrics_save " ,
. data = & sysctl_tcp_nometrics_save ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_moderate_rcvbuf " ,
. data = & sysctl_tcp_moderate_rcvbuf ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2005-04-16 15:20:36 -07:00
} ,
{
. procname = " tcp_tso_win_divisor " ,
. data = & sysctl_tcp_tso_win_divisor ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2005-04-16 15:20:36 -07:00
} ,
{
2005-06-23 12:19:55 -07:00
. procname = " tcp_congestion_control " ,
2005-04-16 15:20:36 -07:00
. mode = 0644 ,
2005-06-23 12:19:55 -07:00
. maxlen = TCP_CA_NAME_MAX ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_tcp_congestion_control ,
2005-04-16 15:20:36 -07:00
} ,
2007-02-09 23:24:47 +09:00
{
2006-03-20 22:40:29 -08:00
. procname = " tcp_workaround_signed_windows " ,
. data = & sysctl_tcp_workaround_signed_windows ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2006-03-20 22:40:29 -08:00
} ,
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 05:50:31 +00:00
{
. procname = " tcp_limit_output_bytes " ,
. data = & sysctl_tcp_limit_output_bytes ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2012-07-17 10:13:05 +02:00
{
. procname = " tcp_challenge_ack_limit " ,
. data = & sysctl_tcp_challenge_ack_limit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2006-06-13 22:33:04 -07:00
{
. procname = " tcp_slow_start_after_idle " ,
. data = & sysctl_tcp_slow_start_after_idle ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2006-06-13 22:33:04 -07:00
} ,
2006-08-03 16:48:06 -07:00
# ifdef CONFIG_NETLABEL
{
. procname = " cipso_cache_enable " ,
. data = & cipso_v4_cache_enabled ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_cache_bucket_size " ,
. data = & cipso_v4_cache_bucketsize ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_rbm_optfmt " ,
. data = & cipso_v4_rbm_optfmt ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
{
. procname = " cipso_rbm_strictvalid " ,
. data = & cipso_v4_rbm_strictvalid ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec ,
2006-08-03 16:48:06 -07:00
} ,
# endif /* CONFIG_NETLABEL */
2006-11-09 16:32:06 -08:00
{
. procname = " tcp_available_congestion_control " ,
. maxlen = TCP_CA_BUF_MAX ,
. mode = 0444 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_tcp_available_congestion_control ,
2006-11-09 16:32:06 -08:00
} ,
2006-11-09 16:35:15 -08:00
{
. procname = " tcp_allowed_congestion_control " ,
. maxlen = TCP_CA_BUF_MAX ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_allowed_congestion_control ,
2006-11-09 16:35:15 -08:00
} ,
2010-02-18 02:47:01 +00:00
{
. procname = " tcp_thin_linear_timeouts " ,
. data = & sysctl_tcp_thin_linear_timeouts ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2013-12-23 14:37:31 +08:00
{
2010-02-18 04:48:19 +00:00
. procname = " tcp_thin_dupack " ,
. data = & sysctl_tcp_thin_dupack ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2012-05-02 13:30:03 +00:00
{
. procname = " tcp_early_retrans " ,
. data = & sysctl_tcp_early_retrans ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & zero ,
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
. extra2 = & four ,
2012-05-02 13:30:03 +00:00
} ,
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 05:46:32 -07:00
{
. procname = " tcp_min_tso_segs " ,
. data = & sysctl_tcp_min_tso_segs ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
2015-05-26 08:55:28 -07:00
. extra1 = & one ,
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 05:46:32 -07:00
. extra2 = & gso_max_segs ,
} ,
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-05 22:36:05 -08:00
{
. procname = " tcp_autocorking " ,
. data = & sysctl_tcp_autocorking ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & zero ,
. extra2 = & one ,
} ,
tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 16:04:38 -05:00
{
. procname = " tcp_invalid_ratelimit " ,
. data = & sysctl_tcp_invalid_ratelimit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_ms_jiffies ,
} ,
2014-09-19 07:38:40 -07:00
{
. procname = " icmp_msgs_per_sec " ,
. data = & sysctl_icmp_msgs_per_sec ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & zero ,
} ,
{
. procname = " icmp_msgs_burst " ,
. data = & sysctl_icmp_msgs_burst ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec_minmax ,
. extra1 = & zero ,
} ,
2007-12-31 00:29:24 -08:00
{
. procname = " udp_mem " ,
. data = & sysctl_udp_mem ,
. maxlen = sizeof ( sysctl_udp_mem ) ,
. mode = 0644 ,
2010-11-09 23:24:26 +00:00
. proc_handler = proc_doulongvec_minmax ,
2007-12-31 00:29:24 -08:00
} ,
{
. procname = " udp_rmem_min " ,
. data = & sysctl_udp_rmem_min ,
. maxlen = sizeof ( sysctl_udp_rmem_min ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_minmax ,
2015-05-27 22:16:49 +03:00
. extra1 = & min_rcvbuf ,
2007-12-31 00:29:24 -08:00
} ,
{
. procname = " udp_wmem_min " ,
. data = & sysctl_udp_wmem_min ,
. maxlen = sizeof ( sysctl_udp_wmem_min ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_minmax ,
2015-05-27 22:16:49 +03:00
. extra1 = & min_sndbuf ,
2007-12-31 00:29:24 -08:00
} ,
2009-11-05 13:32:03 -08:00
{ }
2005-04-16 15:20:36 -07:00
} ;
2007-12-05 01:41:26 -08:00
2008-03-26 01:56:24 -07:00
static struct ctl_table ipv4_net_table [ ] = {
{
. procname = " icmp_echo_ignore_all " ,
. data = & init_net . ipv4 . sysctl_icmp_echo_ignore_all ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_echo_ignore_broadcasts " ,
. data = & init_net . ipv4 . sysctl_icmp_echo_ignore_broadcasts ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ignore_bogus_error_responses " ,
. data = & init_net . ipv4 . sysctl_icmp_ignore_bogus_error_responses ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_errors_use_inbound_ifaddr " ,
. data = & init_net . ipv4 . sysctl_icmp_errors_use_inbound_ifaddr ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ratelimit " ,
. data = & init_net . ipv4 . sysctl_icmp_ratelimit ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec_ms_jiffies ,
2008-03-26 01:56:24 -07:00
} ,
{
. procname = " icmp_ratemask " ,
. data = & init_net . ipv4 . sysctl_icmp_ratemask ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
2008-11-03 18:21:05 -08:00
. proc_handler = proc_dointvec
2008-03-26 01:56:24 -07:00
} ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
{
. procname = " ping_group_range " ,
2014-05-06 11:02:50 -07:00
. data = & init_net . ipv4 . ping_group_range . range ,
2012-05-24 10:34:21 -06:00
. maxlen = sizeof ( gid_t ) * 2 ,
net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).
Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/
A new ping socket is created with
socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.
Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.
ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.
ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).
socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.
The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).
Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping
For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/
Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.
All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.
PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.
PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().
PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.
RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 10:01:00 +00:00
. mode = 0644 ,
. proc_handler = ipv4_ping_group_range ,
} ,
2013-01-05 16:10:48 +00:00
{
. procname = " tcp_ecn " ,
. data = & init_net . ipv4 . sysctl_tcp_ecn ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
{
. procname = " tcp_ecn_fallback " ,
. data = & init_net . ipv4 . sysctl_tcp_ecn_fallback ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2013-09-28 14:10:59 -07:00
{
. procname = " ip_local_port_range " ,
2014-05-06 11:02:49 -07:00
. maxlen = sizeof ( init_net . ipv4 . ip_local_ports . range ) ,
. data = & init_net . ipv4 . ip_local_ports . range ,
2013-09-28 14:10:59 -07:00
. mode = 0644 ,
. proc_handler = ipv4_local_port_range ,
} ,
2014-05-12 16:04:53 -07:00
{
. procname = " ip_local_reserved_ports " ,
. data = & init_net . ipv4 . sysctl_local_reserved_ports ,
. maxlen = 65536 ,
. mode = 0644 ,
. proc_handler = proc_do_large_bitmap ,
} ,
2013-12-14 05:13:38 +01:00
{
. procname = " ip_no_pmtu_disc " ,
. data = & init_net . ipv4 . sysctl_ip_no_pmtu_disc ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2014-01-09 10:01:15 +01:00
{
. procname = " ip_forward_use_pmtu " ,
. data = & init_net . ipv4 . sysctl_ip_fwd_use_pmtu ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2014-09-05 15:09:03 +02:00
{
. procname = " ip_nonlocal_bind " ,
. data = & init_net . ipv4 . sysctl_ip_nonlocal_bind ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec
} ,
2014-05-13 10:17:33 -07:00
{
. procname = " fwmark_reflect " ,
. data = & init_net . ipv4 . sysctl_fwmark_reflect ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 10:17:35 -07:00
{
. procname = " tcp_fwmark_accept " ,
. data = & init_net . ipv4 . sysctl_tcp_fwmark_accept ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2015-02-10 09:53:16 +08:00
{
. procname = " tcp_mtu_probing " ,
. data = & init_net . ipv4 . sysctl_tcp_mtu_probing ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{
. procname = " tcp_base_mss " ,
. data = & init_net . ipv4 . sysctl_tcp_base_mss ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2015-03-06 11:18:23 +08:00
{
. procname = " tcp_probe_threshold " ,
. data = & init_net . ipv4 . sysctl_tcp_probe_threshold ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2015-03-06 11:18:24 +08:00
{
. procname = " tcp_probe_interval " ,
. data = & init_net . ipv4 . sysctl_tcp_probe_interval ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
2008-03-26 01:56:24 -07:00
{ }
} ;
2008-03-26 01:54:18 -07:00
static __net_init int ipv4_sysctl_init_net ( struct net * net )
{
2008-03-26 01:56:24 -07:00
struct ctl_table * table ;
table = ipv4_net_table ;
2009-11-25 15:14:13 -08:00
if ( ! net_eq ( net , & init_net ) ) {
2013-10-19 16:27:03 -07:00
int i ;
2008-03-26 01:56:24 -07:00
table = kmemdup ( table , sizeof ( ipv4_net_table ) , GFP_KERNEL ) ;
2015-04-03 09:17:26 +01:00
if ( ! table )
2008-03-26 01:56:24 -07:00
goto err_alloc ;
2013-10-19 16:27:03 -07:00
/* Update the variables to point into the current struct net */
for ( i = 0 ; i < ARRAY_SIZE ( ipv4_net_table ) - 1 ; i + + )
table [ i ] . data + = ( void * ) net - ( void * ) & init_net ;
2008-03-26 01:56:24 -07:00
}
2012-04-19 13:44:49 +00:00
net - > ipv4 . ipv4_hdr = register_net_sysctl ( net , " net/ipv4 " , table ) ;
2015-04-03 09:17:26 +01:00
if ( ! net - > ipv4 . ipv4_hdr )
2008-03-26 01:56:24 -07:00
goto err_reg ;
2014-05-12 16:04:53 -07:00
net - > ipv4 . sysctl_local_reserved_ports = kzalloc ( 65536 / 8 , GFP_KERNEL ) ;
if ( ! net - > ipv4 . sysctl_local_reserved_ports )
goto err_ports ;
2008-03-26 01:54:18 -07:00
return 0 ;
2008-03-26 01:56:24 -07:00
2014-05-12 16:04:53 -07:00
err_ports :
unregister_net_sysctl_table ( net - > ipv4 . ipv4_hdr ) ;
2008-03-26 01:56:24 -07:00
err_reg :
2009-11-25 15:14:13 -08:00
if ( ! net_eq ( net , & init_net ) )
2008-03-26 01:56:24 -07:00
kfree ( table ) ;
err_alloc :
return - ENOMEM ;
2008-03-26 01:54:18 -07:00
}
static __net_exit void ipv4_sysctl_exit_net ( struct net * net )
{
2008-03-26 01:56:24 -07:00
struct ctl_table * table ;
2014-05-12 16:04:53 -07:00
kfree ( net - > ipv4 . sysctl_local_reserved_ports ) ;
2008-03-26 01:56:24 -07:00
table = net - > ipv4 . ipv4_hdr - > ctl_table_arg ;
unregister_net_sysctl_table ( net - > ipv4 . ipv4_hdr ) ;
kfree ( table ) ;
2008-03-26 01:54:18 -07:00
}
static __net_initdata struct pernet_operations ipv4_sysctl_ops = {
. init = ipv4_sysctl_init_net ,
. exit = ipv4_sysctl_exit_net ,
} ;
2007-12-05 01:41:26 -08:00
static __init int sysctl_ipv4_init ( void )
{
struct ctl_table_header * hdr ;
2012-04-19 13:44:49 +00:00
hdr = register_net_sysctl ( & init_net , " net/ipv4 " , ipv4_table ) ;
2015-04-03 09:17:26 +01:00
if ( ! hdr )
2008-03-26 01:54:18 -07:00
return - ENOMEM ;
if ( register_pernet_subsys ( & ipv4_sysctl_ops ) ) {
2012-04-19 13:24:33 +00:00
unregister_net_sysctl_table ( hdr ) ;
2008-03-26 01:54:18 -07:00
return - ENOMEM ;
}
return 0 ;
2007-12-05 01:41:26 -08:00
}
__initcall ( sysctl_ipv4_init ) ;