2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Implementation of the Transmission Control Protocol ( TCP ) .
*
* IPv4 specific functions
*
* code split from :
* linux / ipv4 / tcp . c
* linux / ipv4 / tcp_input . c
* linux / ipv4 / tcp_output . c
*
* See tcp . c for author information
*/
/*
* Changes :
* David S . Miller : New socket lookup architecture .
* This code is dedicated to John Dyson .
* David S . Miller : Change semantics of established hash ,
* half is devoted to TIME_WAIT sockets
* and the rest go in the other half .
* Andi Kleen : Add support for syncookies and fixed
* some bugs : ip options weren ' t passed to
* the TCP layer , missed a check for an
* ACK bit .
* Andi Kleen : Implemented fast path mtu discovery .
* Fixed many serious bugs in the
2005-06-18 22:47:21 -07:00
* request_sock handling and moved
2005-04-16 15:20:36 -07:00
* most of it into the af independent code .
* Added tail drop and some other bugfixes .
2005-11-10 17:13:47 -08:00
* Added new listen semantics .
2005-04-16 15:20:36 -07:00
* Mike McLagan : Routing by source
* Juan Jose Ciarlante : ip_dynaddr bits
* Andi Kleen : various fixes .
* Vitaly E . Lavrov : Transparent proxy revived after year
* coma .
* Andi Kleen : Fix new listen .
* Andi Kleen : Fix accept error reporting .
* YOSHIFUJI Hideaki @ USAGI and : Support IPV6_V6ONLY socket option , which
* Alexey Kuznetsov allow both IPv4 and IPv6 sockets to bind
* a single port at the same time .
*/
2012-03-12 07:03:32 +00:00
# define pr_fmt(fmt) "TCP: " fmt
2005-04-16 15:20:36 -07:00
2008-12-29 23:04:08 -08:00
# include <linux/bottom_half.h>
2005-04-16 15:20:36 -07:00
# include <linux/types.h>
# include <linux/fcntl.h>
# include <linux/module.h>
# include <linux/random.h>
# include <linux/cache.h>
# include <linux/jhash.h>
# include <linux/init.h>
# include <linux/times.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2005-04-16 15:20:36 -07:00
2007-09-12 12:01:34 +02:00
# include <net/net_namespace.h>
2005-04-16 15:20:36 -07:00
# include <net/icmp.h>
2005-08-09 19:59:20 -07:00
# include <net/inet_hashtables.h>
2005-04-16 15:20:36 -07:00
# include <net/tcp.h>
2005-08-16 02:18:02 -03:00
# include <net/transp_v6.h>
2005-04-16 15:20:36 -07:00
# include <net/ipv6.h>
# include <net/inet_common.h>
2005-12-13 23:25:19 -08:00
# include <net/timewait_sock.h>
2005-04-16 15:20:36 -07:00
# include <net/xfrm.h>
2011-08-03 20:50:44 -07:00
# include <net/secure_seq.h>
2013-07-10 17:13:17 +03:00
# include <net/busy_poll.h>
2005-04-16 15:20:36 -07:00
# include <linux/inet.h>
# include <linux/ipv6.h>
# include <linux/stddef.h>
# include <linux/proc_fs.h>
# include <linux/seq_file.h>
2017-06-15 18:07:06 -07:00
# include <linux/inetdevice.h>
2020-07-20 09:34:03 -07:00
# include <linux/btf_ids.h>
2005-04-16 15:20:36 -07:00
2016-01-24 21:20:23 +08:00
# include <crypto/hash.h>
2006-11-14 19:07:45 -08:00
# include <linux/scatterlist.h>
2017-10-23 09:20:24 -07:00
# include <trace/events/tcp.h>
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2012-01-31 05:18:33 +00:00
static int tcp_v4_md5_hash_hdr ( char * md5_hash , const struct tcp_md5sig_key * key ,
2011-10-24 02:46:04 -04:00
__be32 daddr , __be32 saddr , const struct tcphdr * th ) ;
2006-11-14 19:07:45 -08:00
# endif
2008-11-20 00:40:07 -08:00
struct inet_hashinfo tcp_hashinfo ;
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_hashinfo ) ;
2005-04-16 15:20:36 -07:00
2022-01-24 12:24:57 -08:00
static DEFINE_PER_CPU ( struct sock * , ipv4_tcp_sk ) ;
2017-05-05 06:56:54 -07:00
static u32 tcp_v4_init_seq ( const struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2017-05-05 06:56:54 -07:00
return secure_tcp_seq ( ip_hdr ( skb ) - > daddr ,
ip_hdr ( skb ) - > saddr ,
tcp_hdr ( skb ) - > dest ,
tcp_hdr ( skb ) - > source ) ;
}
2017-06-07 10:34:39 -07:00
static u32 tcp_v4_init_ts_off ( const struct net * net , const struct sk_buff * skb )
2017-05-05 06:56:54 -07:00
{
2017-06-07 10:34:39 -07:00
return secure_tcp_ts_off ( net , ip_hdr ( skb ) - > daddr , ip_hdr ( skb ) - > saddr ) ;
2005-04-16 15:20:36 -07:00
}
2005-12-13 23:25:19 -08:00
int tcp_twsk_unique ( struct sock * sk , struct sock * sktw , void * twp )
{
2022-07-15 10:17:52 -07:00
int reuse = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_tw_reuse ) ;
2018-06-03 10:41:17 -07:00
const struct inet_timewait_sock * tw = inet_twsk ( sktw ) ;
2005-12-13 23:25:19 -08:00
const struct tcp_timewait_sock * tcptw = tcp_twsk ( sktw ) ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
2018-06-03 10:41:17 -07:00
if ( reuse = = 2 ) {
/* Still does not detect *everything* that goes through
* lo , since we require a loopback src or dst address
* or direct binding to ' lo ' interface .
*/
bool loopback = false ;
if ( tw - > tw_bound_dev_if = = LOOPBACK_IFINDEX )
loopback = true ;
# if IS_ENABLED(CONFIG_IPV6)
if ( tw - > tw_family = = AF_INET6 ) {
if ( ipv6_addr_loopback ( & tw - > tw_v6_daddr ) | |
2019-10-01 10:49:06 -07:00
ipv6_addr_v4mapped_loopback ( & tw - > tw_v6_daddr ) | |
2018-06-03 10:41:17 -07:00
ipv6_addr_loopback ( & tw - > tw_v6_rcv_saddr ) | |
2019-10-01 10:49:06 -07:00
ipv6_addr_v4mapped_loopback ( & tw - > tw_v6_rcv_saddr ) )
2018-06-03 10:41:17 -07:00
loopback = true ;
} else
# endif
{
if ( ipv4_is_loopback ( tw - > tw_daddr ) | |
ipv4_is_loopback ( tw - > tw_rcv_saddr ) )
loopback = true ;
}
if ( ! loopback )
reuse = 0 ;
}
2005-12-13 23:25:19 -08:00
/* With PAWS, it is safe from the viewpoint
of data integrity . Even without PAWS it is safe provided sequence
spaces do not overlap i . e . at data rates < = 80 Mbit / sec .
Actually , the idea is close to VJ ' s one , only timestamp cache is
held not per host , but per port pair and TW bucket is used as state
holder .
If TW bucket has been already destroyed we fall back to VJ ' s scheme
and use initial timestamp retrieved from peer table .
*/
if ( tcptw - > tw_ts_recent_stamp & &
2018-07-11 12:16:12 +02:00
( ! twp | | ( reuse & & time_after32 ( ktime_get_seconds ( ) ,
tcptw - > tw_ts_recent_stamp ) ) ) ) {
2018-07-10 17:25:20 -04:00
/* In case of repair and re-using TIME-WAIT sockets we still
* want to be sure that it is safe as above but honor the
* sequence numbers and time stamps set as part of the repair
* process .
*
* Without this check re - using a TIME - WAIT socket with TCP
* repair would accumulate a - 1 on the repair assigned
* sequence number . The first time it is reused the sequence
* is - 1 , the second time - 2 , etc . This fixes that issue
* without appearing to create any others .
*/
if ( likely ( ! tp - > repair ) ) {
2019-10-10 20:17:41 -07:00
u32 seq = tcptw - > tw_snd_nxt + 65535 + 2 ;
if ( ! seq )
seq = 1 ;
WRITE_ONCE ( tp - > write_seq , seq ) ;
2018-07-10 17:25:20 -04:00
tp - > rx_opt . ts_recent = tcptw - > tw_ts_recent ;
tp - > rx_opt . ts_recent_stamp = tcptw - > tw_ts_recent_stamp ;
}
2005-12-13 23:25:19 -08:00
sock_hold ( sktw ) ;
return 1 ;
}
return 0 ;
}
EXPORT_SYMBOL_GPL ( tcp_twsk_unique ) ;
2018-03-30 15:08:05 -07:00
static int tcp_v4_pre_connect ( struct sock * sk , struct sockaddr * uaddr ,
int addr_len )
{
/* This check is replicated from tcp_v4_connect() and intended to
* prevent BPF program called below from accessing bytes that are out
* of the bound specified by user in addr_len .
*/
if ( addr_len < sizeof ( struct sockaddr_in ) )
return - EINVAL ;
sock_owned_by_me ( sk ) ;
return BPF_CGROUP_RUN_PROG_INET4_CONNECT ( sk , uaddr ) ;
}
2005-04-16 15:20:36 -07:00
/* This will initiate an outgoing connection. */
int tcp_v4_connect ( struct sock * sk , struct sockaddr * uaddr , int addr_len )
{
2011-04-26 13:28:44 -07:00
struct sockaddr_in * usin = ( struct sockaddr_in * ) uaddr ;
2022-09-07 18:10:17 -07:00
struct inet_timewait_death_row * tcp_death_row ;
2005-04-16 15:20:36 -07:00
struct inet_sock * inet = inet_sk ( sk ) ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
2022-09-07 18:10:17 -07:00
struct ip_options_rcu * inet_opt ;
struct net * net = sock_net ( sk ) ;
2011-02-24 13:38:12 -08:00
__be16 orig_sport , orig_dport ;
2022-11-18 17:49:13 -08:00
__be32 daddr , nexthop ;
2011-05-06 16:11:19 -07:00
struct flowi4 * fl4 ;
2011-04-26 13:28:44 -07:00
struct rtable * rt ;
2005-04-16 15:20:36 -07:00
int err ;
if ( addr_len < sizeof ( struct sockaddr_in ) )
return - EINVAL ;
if ( usin - > sin_family ! = AF_INET )
return - EAFNOSUPPORT ;
nexthop = daddr = usin - > sin_addr . s_addr ;
2011-04-21 09:45:37 +00:00
inet_opt = rcu_dereference_protected ( inet - > inet_opt ,
2016-04-05 17:10:15 +02:00
lockdep_sock_is_held ( sk ) ) ;
2011-04-21 09:45:37 +00:00
if ( inet_opt & & inet_opt - > opt . srr ) {
2005-04-16 15:20:36 -07:00
if ( ! daddr )
return - EINVAL ;
2011-04-21 09:45:37 +00:00
nexthop = inet_opt - > opt . faddr ;
2005-04-16 15:20:36 -07:00
}
2011-02-24 13:38:12 -08:00
orig_sport = inet - > inet_sport ;
orig_dport = usin - > sin_port ;
2011-05-06 16:11:19 -07:00
fl4 = & inet - > cork . fl . u . ip4 ;
rt = ip_route_connect ( fl4 , nexthop , inet - > inet_saddr ,
2022-04-21 01:21:33 +02:00
sk - > sk_bound_dev_if , IPPROTO_TCP , orig_sport ,
orig_dport , sk ) ;
2011-03-02 14:31:35 -08:00
if ( IS_ERR ( rt ) ) {
err = PTR_ERR ( rt ) ;
if ( err = = - ENETUNREACH )
2022-09-07 18:10:17 -07:00
IP_INC_STATS ( net , IPSTATS_MIB_OUTNOROUTES ) ;
2011-03-02 14:31:35 -08:00
return err ;
2007-05-31 22:49:28 -07:00
}
2005-04-16 15:20:36 -07:00
if ( rt - > rt_flags & ( RTCF_MULTICAST | RTCF_BROADCAST ) ) {
ip_rt_put ( rt ) ;
return - ENETUNREACH ;
}
2011-04-21 09:45:37 +00:00
if ( ! inet_opt | | ! inet_opt - > opt . srr )
2011-05-06 16:11:19 -07:00
daddr = fl4 - > daddr ;
2005-04-16 15:20:36 -07:00
2022-09-07 18:10:20 -07:00
tcp_death_row = & sock_net ( sk ) - > ipv4 . tcp_death_row ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
if ( ! inet - > inet_saddr ) {
2022-11-18 17:49:13 -08:00
err = inet_bhash2_update_saddr ( sk , & fl4 - > saddr , AF_INET ) ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
if ( err ) {
ip_rt_put ( rt ) ;
return err ;
}
2022-11-18 17:49:13 -08:00
} else {
sk_rcv_saddr_set ( sk , inet - > inet_saddr ) ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 11:10:21 -07:00
}
2009-10-15 06:30:45 +00:00
if ( tp - > rx_opt . ts_recent_stamp & & inet - > inet_daddr ! = daddr ) {
2005-04-16 15:20:36 -07:00
/* Reset inherited state */
tp - > rx_opt . ts_recent = 0 ;
tp - > rx_opt . ts_recent_stamp = 0 ;
2012-04-19 03:40:39 +00:00
if ( likely ( ! tp - > repair ) )
2019-10-10 20:17:41 -07:00
WRITE_ONCE ( tp - > write_seq , 0 ) ;
2005-04-16 15:20:36 -07:00
}
2009-10-15 06:30:45 +00:00
inet - > inet_dport = usin - > sin_port ;
2015-03-18 14:05:35 -07:00
sk_daddr_set ( sk , daddr ) ;
2005-04-16 15:20:36 -07:00
2005-12-13 23:26:10 -08:00
inet_csk ( sk ) - > icsk_ext_hdr_len = 0 ;
2011-04-21 09:45:37 +00:00
if ( inet_opt )
inet_csk ( sk ) - > icsk_ext_hdr_len = inet_opt - > opt . optlen ;
2005-04-16 15:20:36 -07:00
2009-11-10 09:51:18 +00:00
tp - > rx_opt . mss_clamp = TCP_MSS_DEFAULT ;
2005-04-16 15:20:36 -07:00
/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN - SENT and not releasing socket
* lock select source port , enter ourselves into the hash tables and
* complete initialization after this .
*/
tcp_set_state ( sk , TCP_SYN_SENT ) ;
2016-12-28 17:52:32 +08:00
err = inet_hash_connect ( tcp_death_row , sk ) ;
2005-04-16 15:20:36 -07:00
if ( err )
goto failure ;
2015-07-28 16:02:05 -07:00
sk_set_txhash ( sk ) ;
2014-10-22 21:42:01 +05:30
2011-05-06 16:11:19 -07:00
rt = ip_route_newports ( fl4 , rt , orig_sport , orig_dport ,
2011-03-02 14:31:35 -08:00
inet - > inet_sport , inet - > inet_dport , sk ) ;
if ( IS_ERR ( rt ) ) {
err = PTR_ERR ( rt ) ;
rt = NULL ;
2005-04-16 15:20:36 -07:00
goto failure ;
2011-03-02 14:31:35 -08:00
}
2005-04-16 15:20:36 -07:00
/* OK, now commit destination to socket. */
2006-06-30 13:36:35 -07:00
sk - > sk_gso_type = SKB_GSO_TCPV4 ;
2010-06-10 23:31:35 -07:00
sk_setup_caps ( sk , & rt - > dst ) ;
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 10:59:22 -08:00
rt = NULL ;
2005-04-16 15:20:36 -07:00
2017-02-22 13:23:55 +03:00
if ( likely ( ! tp - > repair ) ) {
if ( ! tp - > write_seq )
2019-10-10 20:17:41 -07:00
WRITE_ONCE ( tp - > write_seq ,
secure_tcp_seq ( inet - > inet_saddr ,
inet - > inet_daddr ,
inet - > inet_sport ,
usin - > sin_port ) ) ;
2022-09-07 18:10:17 -07:00
tp - > tsoffset = secure_tcp_ts_off ( net , inet - > inet_saddr ,
2017-05-05 06:56:54 -07:00
inet - > inet_daddr ) ;
2017-02-22 13:23:55 +03:00
}
2005-04-16 15:20:36 -07:00
2022-10-05 17:23:53 +02:00
inet - > inet_id = get_random_u16 ( ) ;
2005-04-16 15:20:36 -07:00
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 10:59:22 -08:00
if ( tcp_fastopen_defer_connect ( sk , & err ) )
return err ;
if ( err )
goto failure ;
2012-11-22 01:13:58 +00:00
err = tcp_connect ( sk ) ;
2012-04-19 03:40:39 +00:00
2005-04-16 15:20:36 -07:00
if ( err )
goto failure ;
return 0 ;
failure :
2006-11-17 10:57:30 -02:00
/*
* This unhashes the socket and releases the local port ,
* if necessary .
*/
2005-04-16 15:20:36 -07:00
tcp_set_state ( sk , TCP_CLOSE ) ;
2022-11-18 17:49:14 -08:00
inet_bhash2_reset_saddr ( sk ) ;
2005-04-16 15:20:36 -07:00
ip_rt_put ( rt ) ;
sk - > sk_route_caps = 0 ;
2009-10-15 06:30:45 +00:00
inet - > inet_dport = 0 ;
2005-04-16 15:20:36 -07:00
return err ;
}
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_v4_connect ) ;
2005-04-16 15:20:36 -07:00
/*
2012-07-23 09:48:52 +02:00
* This routine reacts to ICMP_FRAG_NEEDED mtu indications as defined in RFC1191 .
* It can be called through tcp_release_cb ( ) if socket was owned by user
* at the time tcp_v4_err ( ) was called to handle ICMP message .
2005-04-16 15:20:36 -07:00
*/
2014-08-14 12:40:05 -04:00
void tcp_v4_mtu_reduced ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
struct inet_sock * inet = inet_sk ( sk ) ;
2017-03-03 14:08:21 -08:00
struct dst_entry * dst ;
u32 mtu ;
2005-04-16 15:20:36 -07:00
2017-03-03 14:08:21 -08:00
if ( ( 1 < < sk - > sk_state ) & ( TCPF_LISTEN | TCPF_CLOSE ) )
return ;
2021-07-02 13:09:03 -07:00
mtu = READ_ONCE ( tcp_sk ( sk ) - > mtu_info ) ;
2012-07-16 03:28:06 -07:00
dst = inet_csk_update_pmtu ( sk , mtu ) ;
if ( ! dst )
2005-04-16 15:20:36 -07:00
return ;
/* Something is about to be wrong... Remember soft error
* for the case , if this connection will not able to recover .
*/
if ( mtu < dst_mtu ( dst ) & & ip_dont_fragment ( sk , dst ) )
2023-03-15 20:57:41 +00:00
WRITE_ONCE ( sk - > sk_err_soft , EMSGSIZE ) ;
2005-04-16 15:20:36 -07:00
mtu = dst_mtu ( dst ) ;
if ( inet - > pmtudisc ! = IP_PMTUDISC_DONT & &
2013-11-05 02:24:17 +01:00
ip_sk_accept_pmtu ( sk ) & &
2005-12-13 23:26:10 -08:00
inet_csk ( sk ) - > icsk_pmtu_cookie > mtu ) {
2005-04-16 15:20:36 -07:00
tcp_sync_mss ( sk , mtu ) ;
/* Resend the TCP packet because it's
* clear that the old packet has been
* dropped . This is the new " fast " path mtu
* discovery .
*/
tcp_simple_retransmit ( sk ) ;
} /* else let the usual retransmit timer handle it */
}
2014-08-14 12:40:05 -04:00
EXPORT_SYMBOL ( tcp_v4_mtu_reduced ) ;
2005-04-16 15:20:36 -07:00
2012-07-11 21:27:49 -07:00
static void do_redirect ( struct sk_buff * skb , struct sock * sk )
{
struct dst_entry * dst = __sk_dst_check ( sk , 0 ) ;
2012-07-12 00:41:25 -07:00
if ( dst )
2012-07-17 03:29:28 -07:00
dst - > ops - > redirect ( dst , sk , skb ) ;
2012-07-11 21:27:49 -07:00
}
2015-03-22 10:22:22 -07:00
/* handle ICMP messages on TCP_NEW_SYN_RECV request sockets */
2016-02-02 19:31:12 -08:00
void tcp_req_err ( struct sock * sk , u32 seq , bool abort )
2015-03-22 10:22:22 -07:00
{
struct request_sock * req = inet_reqsk ( sk ) ;
struct net * net = sock_net ( sk ) ;
/* ICMPs are not backlogged, hence we cannot get
* an established socket here .
*/
if ( seq ! = tcp_rsk ( req ) - > snt_isn ) {
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_OUTOFWINDOWICMPS ) ;
2016-02-02 19:31:12 -08:00
} else if ( abort ) {
2015-03-22 10:22:22 -07:00
/*
* Still in SYN_RECV , just remove it silently .
* There is no good way to pass the error to the newly
* created socket , and POSIX does not want network
* errors returned from accept ( ) .
*/
2015-03-23 15:00:41 -07:00
inet_csk_reqsk_queue_drop ( req - > rsk_listener , req ) ;
2016-04-01 08:52:20 -07:00
tcp_listendrop ( req - > rsk_listener ) ;
2015-03-22 10:22:22 -07:00
}
2015-10-14 11:16:26 -07:00
reqsk_put ( req ) ;
2015-03-22 10:22:22 -07:00
}
EXPORT_SYMBOL ( tcp_req_err ) ;
2020-05-26 19:48:49 -07:00
/* TCP-LD (RFC 6069) logic */
2020-05-27 17:34:58 -07:00
void tcp_ld_RTO_revert ( struct sock * sk , u32 seq )
2020-05-26 19:48:49 -07:00
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
struct sk_buff * skb ;
s32 remaining ;
u32 delta_us ;
if ( sock_owned_by_user ( sk ) )
return ;
if ( seq ! = tp - > snd_una | | ! icsk - > icsk_retransmits | |
! icsk - > icsk_backoff )
return ;
skb = tcp_rtx_queue_head ( sk ) ;
if ( WARN_ON_ONCE ( ! skb ) )
return ;
icsk - > icsk_backoff - - ;
icsk - > icsk_rto = tp - > srtt_us ? __tcp_set_rto ( tp ) : TCP_TIMEOUT_INIT ;
icsk - > icsk_rto = inet_csk_rto_backoff ( icsk , TCP_RTO_MAX ) ;
tcp_mstamp_refresh ( tp ) ;
delta_us = ( u32 ) ( tp - > tcp_mstamp - tcp_skb_timestamp_us ( skb ) ) ;
remaining = icsk - > icsk_rto - usecs_to_jiffies ( delta_us ) ;
if ( remaining > 0 ) {
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
remaining , TCP_RTO_MAX ) ;
} else {
/* RTO revert clocked out retransmission.
* Will retransmit now .
*/
tcp_retransmit_timer ( sk ) ;
}
}
2020-05-27 17:34:58 -07:00
EXPORT_SYMBOL ( tcp_ld_RTO_revert ) ;
2020-05-26 19:48:49 -07:00
2005-04-16 15:20:36 -07:00
/*
* This routine is called by the ICMP module when it gets some
* sort of error condition . If err < 0 then the socket should
* be closed and the error returned to the user . If err > 0
* it ' s just the icmp type < < 8 | icmp code . After adjustment
* header points to the first 8 bytes of the tcp header . We need
* to find the appropriate port .
*
* The locking strategy used here is very " optimistic " . When
* someone else accesses the socket the ICMP is just dropped
* and for some paths there is no check at all .
* A more general error queue to queue errors for later handling
* is probably better .
*
*/
2020-05-26 19:48:50 -07:00
int tcp_v4_err ( struct sk_buff * skb , u32 info )
2005-04-16 15:20:36 -07:00
{
2020-05-26 19:48:50 -07:00
const struct iphdr * iph = ( const struct iphdr * ) skb - > data ;
struct tcphdr * th = ( struct tcphdr * ) ( skb - > data + ( iph - > ihl < < 2 ) ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp ;
struct inet_sock * inet ;
2020-05-26 19:48:50 -07:00
const int type = icmp_hdr ( skb ) - > type ;
const int code = icmp_hdr ( skb ) - > code ;
2005-04-16 15:20:36 -07:00
struct sock * sk ;
2014-05-11 20:22:12 -07:00
struct request_sock * fastopen ;
2017-05-16 14:00:14 -07:00
u32 seq , snd_una ;
2005-04-16 15:20:36 -07:00
int err ;
2020-05-26 19:48:50 -07:00
struct net * net = dev_net ( skb - > dev ) ;
2005-04-16 15:20:36 -07:00
2022-09-07 18:10:20 -07:00
sk = __inet_lookup_established ( net , net - > ipv4 . tcp_death_row . hashinfo ,
iph - > daddr , th - > dest , iph - > saddr ,
ntohs ( th - > source ) , inet_iif ( skb ) , 0 ) ;
2005-04-16 15:20:36 -07:00
if ( ! sk ) {
2016-04-27 16:44:29 -07:00
__ICMP_INC_STATS ( net , ICMP_MIB_INERRORS ) ;
2018-11-08 12:19:21 +01:00
return - ENOENT ;
2005-04-16 15:20:36 -07:00
}
if ( sk - > sk_state = = TCP_TIME_WAIT ) {
2006-10-10 19:41:46 -07:00
inet_twsk_put ( inet_twsk ( sk ) ) ;
2018-11-08 12:19:21 +01:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2015-03-22 10:22:22 -07:00
seq = ntohl ( th - > seq ) ;
2018-11-08 12:19:21 +01:00
if ( sk - > sk_state = = TCP_NEW_SYN_RECV ) {
tcp_req_err ( sk , seq , type = = ICMP_PARAMETERPROB | |
type = = ICMP_TIME_EXCEEDED | |
( type = = ICMP_DEST_UNREACH & &
( code = = ICMP_NET_UNREACH | |
code = = ICMP_HOST_UNREACH ) ) ) ;
return 0 ;
}
2005-04-16 15:20:36 -07:00
bh_lock_sock ( sk ) ;
/* If too many ICMPs get dropped on busy
* servers this needs to be solved differently .
2012-07-23 09:48:52 +02:00
* We do take care of PMTU discovery ( RFC1191 ) special case :
* we can receive locally generated ICMP messages while socket is held .
2005-04-16 15:20:36 -07:00
*/
2013-01-19 16:10:37 +00:00
if ( sock_owned_by_user ( sk ) ) {
if ( ! ( type = = ICMP_DEST_UNREACH & & code = = ICMP_FRAG_NEEDED ) )
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_LOCKDROPPEDICMPS ) ;
2013-01-19 16:10:37 +00:00
}
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state = = TCP_CLOSE )
goto out ;
2021-10-25 09:48:24 -07:00
if ( static_branch_unlikely ( & ip4_min_ttl ) ) {
/* min_ttl can be changed concurrently from do_ip_setsockopt() */
if ( unlikely ( iph - > ttl < READ_ONCE ( inet_sk ( sk ) - > min_ttl ) ) ) {
__NET_INC_STATS ( net , LINUX_MIB_TCPMINTTLDROP ) ;
goto out ;
}
2010-03-18 11:27:32 +00:00
}
2005-04-16 15:20:36 -07:00
tp = tcp_sk ( sk ) ;
2014-05-11 20:22:12 -07:00
/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
2019-10-10 20:17:38 -07:00
fastopen = rcu_dereference ( tp - > fastopen_rsk ) ;
2014-05-11 20:22:12 -07:00
snd_una = fastopen ? tcp_rsk ( fastopen ) - > snt_isn : tp - > snd_una ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state ! = TCP_LISTEN & &
2014-05-11 20:22:12 -07:00
! between ( seq , snd_una , tp - > snd_nxt ) ) {
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_OUTOFWINDOWICMPS ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
switch ( type ) {
2012-07-11 21:27:49 -07:00
case ICMP_REDIRECT :
2017-03-10 16:40:33 +11:00
if ( ! sock_owned_by_user ( sk ) )
2020-05-26 19:48:50 -07:00
do_redirect ( skb , sk ) ;
2012-07-11 21:27:49 -07:00
goto out ;
2005-04-16 15:20:36 -07:00
case ICMP_SOURCE_QUENCH :
/* Just silently ignore these. */
goto out ;
case ICMP_PARAMETERPROB :
err = EPROTO ;
break ;
case ICMP_DEST_UNREACH :
if ( code > NR_ICMP_UNREACH )
goto out ;
if ( code = = ICMP_FRAG_NEEDED ) { /* PMTU discovery (RFC1191) */
2013-03-18 07:01:28 +00:00
/* We are not interested in TCP_LISTEN and open_requests
* ( SYN - ACKs send out by Linux are always < 576 bytes so
* they should go through unfragmented ) .
*/
if ( sk - > sk_state = = TCP_LISTEN )
goto out ;
2021-07-02 13:09:03 -07:00
WRITE_ONCE ( tp - > mtu_info , info ) ;
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 00:22:46 +00:00
if ( ! sock_owned_by_user ( sk ) ) {
2012-07-23 09:48:52 +02:00
tcp_v4_mtu_reduced ( sk ) ;
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 00:22:46 +00:00
} else {
2016-12-03 11:14:57 -08:00
if ( ! test_and_set_bit ( TCP_MTU_REDUCED_DEFERRED , & sk - > sk_tsq_flags ) )
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 00:22:46 +00:00
sock_hold ( sk ) ;
}
2005-04-16 15:20:36 -07:00
goto out ;
}
err = icmp_err_convert [ code ] . errno ;
2020-05-26 19:48:49 -07:00
/* check if this ICMP message allows revert of backoff.
* ( see RFC 6069 )
*/
if ( ! fastopen & &
( code = = ICMP_NET_UNREACH | | code = = ICMP_HOST_UNREACH ) )
tcp_ld_RTO_revert ( sk , seq ) ;
2005-04-16 15:20:36 -07:00
break ;
case ICMP_TIME_EXCEEDED :
err = EHOSTUNREACH ;
break ;
default :
goto out ;
}
switch ( sk - > sk_state ) {
case TCP_SYN_SENT :
2014-05-11 20:22:12 -07:00
case TCP_SYN_RECV :
/* Only in fast or simultaneous open. If a fast open socket is
2020-08-22 16:31:41 -07:00
* already accepted it is treated as a connected one below .
2014-05-11 20:22:12 -07:00
*/
2015-04-03 09:17:26 +01:00
if ( fastopen & & ! fastopen - > sk )
2014-05-11 20:22:12 -07:00
break ;
2020-05-26 19:48:50 -07:00
ip_icmp_error ( sk , skb , err , th - > dest , info , ( u8 * ) th ) ;
tcp: allow traceroute -Mtcp for unpriv users
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.
$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 192.168.86.1 3.631 ms 3.512 ms 3.405 ms
2 10.1.10.1 4.183 ms 4.125 ms 4.072 ms
3 96.120.88.125 20.621 ms 19.462 ms 20.553 ms
4 96.110.177.65 24.271 ms 25.351 ms 25.250 ms
5 69.139.199.197 44.492 ms 43.075 ms 44.346 ms
6 68.86.143.93 27.969 ms 25.184 ms 25.092 ms
7 96.112.146.18 25.323 ms 96.112.146.22 25.583 ms 96.112.146.26 24.502 ms
8 72.14.239.204 24.405 ms 74.125.37.224 16.326 ms 17.194 ms
9 209.85.251.9 18.154 ms 209.85.247.55 14.449 ms 209.85.251.9 26.296 ms^C
We can easily support traceroute over TCP, by queueing an error message
into socket error queue.
Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
{cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 11:00:02 -07:00
2005-04-16 15:20:36 -07:00
if ( ! sock_owned_by_user ( sk ) ) {
2023-03-15 20:57:44 +00:00
WRITE_ONCE ( sk - > sk_err , err ) ;
2005-04-16 15:20:36 -07:00
2021-06-27 18:48:21 -04:00
sk_error_report ( sk ) ;
2005-04-16 15:20:36 -07:00
tcp_done ( sk ) ;
} else {
2023-03-15 20:57:41 +00:00
WRITE_ONCE ( sk - > sk_err_soft , err ) ;
2005-04-16 15:20:36 -07:00
}
goto out ;
}
/* If we've already connected we will keep trying
* until we time out , or the user gives up .
*
* rfc1122 4.2 .3 .9 allows to consider as hard errors
* only PROTO_UNREACH and PORT_UNREACH ( well , FRAG_FAILED too ,
* but it is obsoleted by pmtu discovery ) .
*
* Note , that in modern internet , where routing is unreliable
* and in each dark corner broken firewalls sit , sending random
* errors ordered by their masters even this two messages finally lose
* their original sense ( even Linux sends invalid PORT_UNREACHs )
*
* Now we are in compliance with RFCs .
* - - ANK ( 980905 )
*/
inet = inet_sk ( sk ) ;
if ( ! sock_owned_by_user ( sk ) & & inet - > recverr ) {
2023-03-15 20:57:44 +00:00
WRITE_ONCE ( sk - > sk_err , err ) ;
2021-06-27 18:48:21 -04:00
sk_error_report ( sk ) ;
2005-04-16 15:20:36 -07:00
} else { /* Only an error on timeout */
2023-03-15 20:57:41 +00:00
WRITE_ONCE ( sk - > sk_err_soft , err ) ;
2005-04-16 15:20:36 -07:00
}
out :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
2018-11-08 12:19:21 +01:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
2013-06-07 05:11:46 +00:00
void __tcp_v4_send_check ( struct sk_buff * skb , __be32 saddr , __be32 daddr )
2005-04-16 15:20:36 -07:00
{
2007-04-10 21:04:22 -07:00
struct tcphdr * th = tcp_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
2018-02-19 11:56:52 -08:00
th - > check = ~ tcp_v4_check ( skb - > len , saddr , daddr , 0 ) ;
skb - > csum_start = skb_transport_header ( skb ) - skb - > head ;
skb - > csum_offset = offsetof ( struct tcphdr , check ) ;
2005-04-16 15:20:36 -07:00
}
2010-04-11 02:15:53 +00:00
/* This routine computes an IPv4 TCP checksum. */
2010-04-11 02:15:55 +00:00
void tcp_v4_send_check ( struct sock * sk , struct sk_buff * skb )
2010-04-11 02:15:53 +00:00
{
2011-10-21 05:22:42 -04:00
const struct inet_sock * inet = inet_sk ( sk ) ;
2010-04-11 02:15:53 +00:00
__tcp_v4_send_check ( skb , inet - > inet_saddr , inet - > inet_daddr ) ;
}
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_v4_send_check ) ;
2010-04-11 02:15:53 +00:00
2005-04-16 15:20:36 -07:00
/*
* This routine will send an RST to the other tcp .
*
* Someone asks : why I NEVER use socket parameters ( TOS , TTL etc . )
* for reset .
* Answer : if a packet caused RST , it is not for a socket
* existing in our system , if it is matched to a socket ,
* it is just duplicate segment or bug in other side ' s TCP .
* So that we build reply only basing on parameters
* arrived with segment .
* Exception : precedence violation . We do not implement it in any case .
*/
2021-04-01 16:19:44 -07:00
# ifdef CONFIG_TCP_MD5SIG
# define OPTION_BYTES TCPOLEN_MD5SIG_ALIGNED
# else
# define OPTION_BYTES sizeof(__be32)
# endif
2015-09-29 07:42:39 -07:00
static void tcp_v4_send_reset ( const struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2011-10-21 05:22:42 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2006-11-14 19:07:45 -08:00
struct {
struct tcphdr th ;
2021-04-01 16:19:44 -07:00
__be32 opt [ OPTION_BYTES / sizeof ( __be32 ) ] ;
2006-11-14 19:07:45 -08:00
} rep ;
2005-04-16 15:20:36 -07:00
struct ip_reply_arg arg ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2015-12-21 21:29:25 +01:00
struct tcp_md5sig_key * key = NULL ;
2012-01-31 22:35:48 +00:00
const __u8 * hash_location = NULL ;
unsigned char newhash [ 16 ] ;
int genhash ;
struct sock * sk1 = NULL ;
2006-11-14 19:07:45 -08:00
# endif
2019-06-13 21:22:35 -07:00
u64 transmit_time = 0 ;
2018-05-10 16:53:51 +10:00
struct sock * ctl_sk ;
2019-06-13 21:22:35 -07:00
struct net * net ;
2023-05-23 18:14:52 +02:00
u32 txhash = 0 ;
2005-04-16 15:20:36 -07:00
/* Never send a reset in response to a reset. */
if ( th - > rst )
return ;
2014-11-25 07:40:04 -08:00
/* If sk not NULL, it means we did a successful lookup and incoming
* route had to be correct . prequeue might have dropped our dst .
*/
if ( ! sk & & skb_rtable ( skb ) - > rt_type ! = RTN_LOCAL )
2005-04-16 15:20:36 -07:00
return ;
/* Swap the send and the receive. */
2006-11-14 19:07:45 -08:00
memset ( & rep , 0 , sizeof ( rep ) ) ;
rep . th . dest = th - > source ;
rep . th . source = th - > dest ;
rep . th . doff = sizeof ( struct tcphdr ) / 4 ;
rep . th . rst = 1 ;
2005-04-16 15:20:36 -07:00
if ( th - > ack ) {
2006-11-14 19:07:45 -08:00
rep . th . seq = th - > ack_seq ;
2005-04-16 15:20:36 -07:00
} else {
2006-11-14 19:07:45 -08:00
rep . th . ack = 1 ;
rep . th . ack_seq = htonl ( ntohl ( th - > seq ) + th - > syn + th - > fin +
skb - > len - ( th - > doff < < 2 ) ) ;
2005-04-16 15:20:36 -07:00
}
2006-11-17 10:57:30 -02:00
memset ( & arg , 0 , sizeof ( arg ) ) ;
2006-11-14 19:07:45 -08:00
arg . iov [ 0 ] . iov_base = ( unsigned char * ) & rep ;
arg . iov [ 0 ] . iov_len = sizeof ( rep . th ) ;
2014-12-09 09:56:08 -08:00
net = sk ? sock_net ( sk ) : dev_net ( skb_dst ( skb ) - > dev ) ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2016-04-01 08:52:17 -07:00
rcu_read_lock ( ) ;
2012-01-31 22:35:48 +00:00
hash_location = tcp_parse_md5sig_option ( th ) ;
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
if ( sk & & sk_fullsock ( sk ) ) {
2019-12-30 14:14:25 -08:00
const union tcp_md5_addr * addr ;
2019-12-30 14:14:28 -08:00
int l3index ;
2019-12-30 14:14:25 -08:00
2019-12-30 14:14:28 -08:00
/* sdif set, means packet ingressed via a device
* in an L3 domain and inet_iif is set to it .
*/
l3index = tcp_v4_sdif ( skb ) ? inet_iif ( skb ) : 0 ;
2019-12-30 14:14:25 -08:00
addr = ( union tcp_md5_addr * ) & ip_hdr ( skb ) - > saddr ;
2019-12-30 14:14:28 -08:00
key = tcp_md5_do_lookup ( sk , l3index , addr , AF_INET ) ;
2015-12-21 21:29:25 +01:00
} else if ( hash_location ) {
2019-12-30 14:14:25 -08:00
const union tcp_md5_addr * addr ;
2019-12-30 14:14:27 -08:00
int sdif = tcp_v4_sdif ( skb ) ;
int dif = inet_iif ( skb ) ;
2019-12-30 14:14:28 -08:00
int l3index ;
2019-12-30 14:14:25 -08:00
2012-01-31 22:35:48 +00:00
/*
* active side is lost . Try to find listening socket through
* source port , and then find md5 key through listening socket .
* we are not loose security here :
* Incoming packet is checked with md5 hash with finding key ,
* no RST generated if md5 hash doesn ' t match .
*/
2022-09-07 18:10:20 -07:00
sk1 = __inet_lookup_listener ( net , net - > ipv4 . tcp_death_row . hashinfo ,
NULL , 0 , ip_hdr ( skb ) - > saddr ,
2013-01-22 09:50:24 +00:00
th - > source , ip_hdr ( skb ) - > daddr ,
2019-12-30 14:14:27 -08:00
ntohs ( th - > source ) , dif , sdif ) ;
2012-01-31 22:35:48 +00:00
/* don't send rst if it can't find key */
if ( ! sk1 )
2016-04-01 08:52:17 -07:00
goto out ;
2019-12-30 14:14:28 -08:00
/* sdif set, means packet ingressed via a device
* in an L3 domain and dif is set to it .
*/
l3index = sdif ? dif : 0 ;
2019-12-30 14:14:25 -08:00
addr = ( union tcp_md5_addr * ) & ip_hdr ( skb ) - > saddr ;
2019-12-30 14:14:28 -08:00
key = tcp_md5_do_lookup ( sk1 , l3index , addr , AF_INET ) ;
2012-01-31 22:35:48 +00:00
if ( ! key )
2016-04-01 08:52:17 -07:00
goto out ;
2012-01-31 22:35:48 +00:00
2015-03-24 15:58:55 -07:00
genhash = tcp_v4_md5_hash_skb ( newhash , key , NULL , skb ) ;
2012-01-31 22:35:48 +00:00
if ( genhash | | memcmp ( hash_location , newhash , 16 ) ! = 0 )
2016-04-01 08:52:17 -07:00
goto out ;
2012-01-31 22:35:48 +00:00
}
2006-11-14 19:07:45 -08:00
if ( key ) {
rep . opt [ 0 ] = htonl ( ( TCPOPT_NOP < < 24 ) |
( TCPOPT_NOP < < 16 ) |
( TCPOPT_MD5SIG < < 8 ) |
TCPOLEN_MD5SIG ) ;
/* Update length and the length the header thinks exists */
arg . iov [ 0 ] . iov_len + = TCPOLEN_MD5SIG_ALIGNED ;
rep . th . doff = arg . iov [ 0 ] . iov_len / 4 ;
2008-07-19 00:01:42 -07:00
tcp_v4_md5_hash_hdr ( ( __u8 * ) & rep . opt [ 1 ] ,
2008-10-09 14:37:47 -07:00
key , ip_hdr ( skb ) - > saddr ,
ip_hdr ( skb ) - > daddr , & rep . th ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2021-04-01 16:19:44 -07:00
/* Can't co-exist with TCPMD5, hence check rep.opt[0] */
if ( rep . opt [ 0 ] = = 0 ) {
__be32 mrst = mptcp_reset_option ( skb ) ;
if ( mrst ) {
rep . opt [ 0 ] = mrst ;
arg . iov [ 0 ] . iov_len + = sizeof ( mrst ) ;
rep . th . doff = arg . iov [ 0 ] . iov_len / 4 ;
}
}
2007-04-20 22:47:35 -07:00
arg . csum = csum_tcpudp_nofold ( ip_hdr ( skb ) - > daddr ,
ip_hdr ( skb ) - > saddr , /* XXX */
2008-10-08 11:34:06 -07:00
arg . iov [ 0 ] . iov_len , IPPROTO_TCP , 0 ) ;
2005-04-16 15:20:36 -07:00
arg . csumoffset = offsetof ( struct tcphdr , check ) / 2 ;
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
arg . flags = ( sk & & inet_sk_transparent ( sk ) ) ? IP_REPLY_ARG_NOSRCCHECK : 0 ;
2012-02-04 12:38:09 +00:00
/* When socket is gone, all binding information is lost.
2012-10-12 04:34:17 +00:00
* routing might fail in this case . No choice here , if we choose to force
* input interface , we will misroute in case of asymmetric route .
2012-02-04 12:38:09 +00:00
*/
2017-10-23 09:20:24 -07:00
if ( sk ) {
2012-10-12 04:34:17 +00:00
arg . bound_dev_if = sk - > sk_bound_dev_if ;
2018-02-06 20:50:23 -08:00
if ( sk_fullsock ( sk ) )
trace_tcp_send_reset ( sk , skb ) ;
2017-10-23 09:20:24 -07:00
}
2005-04-16 15:20:36 -07:00
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
BUILD_BUG_ON ( offsetof ( struct sock , sk_bound_dev_if ) ! =
offsetof ( struct inet_timewait_sock , tw_bound_dev_if ) ) ;
2011-10-24 03:06:21 -04:00
arg . tos = ip_hdr ( skb ) - > tos ;
2016-11-04 02:23:43 +09:00
arg . uid = sock_net_uid ( net , sk & & sk_fullsock ( sk ) ? sk : NULL ) ;
2016-05-06 09:46:18 -07:00
local_bh_disable ( ) ;
2022-01-24 12:24:57 -08:00
ctl_sk = this_cpu_read ( ipv4_tcp_sk ) ;
sock_net_set ( ctl_sk , net ) ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:57:25 -07:00
if ( sk ) {
2018-05-10 16:53:51 +10:00
ctl_sk - > sk_mark = ( sk - > sk_state = = TCP_TIME_WAIT ) ?
inet_twsk ( sk ) - > tw_mark : sk - > sk_mark ;
2019-09-24 08:01:16 -07:00
ctl_sk - > sk_priority = ( sk - > sk_state = = TCP_TIME_WAIT ) ?
inet_twsk ( sk ) - > tw_priority : sk - > sk_priority ;
2019-06-13 21:22:35 -07:00
transmit_time = tcp_transmit_time ( sk ) ;
2022-07-07 10:01:39 +00:00
xfrm_sk_clone_policy ( ctl_sk , sk ) ;
2023-05-23 18:14:52 +02:00
txhash = ( sk - > sk_state = = TCP_TIME_WAIT ) ?
inet_twsk ( sk ) - > tw_txhash : sk - > sk_txhash ;
2023-05-11 11:47:49 +00:00
} else {
ctl_sk - > sk_mark = 0 ;
ctl_sk - > sk_priority = 0 ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:57:25 -07:00
}
2018-05-10 16:53:51 +10:00
ip_send_unicast_reply ( ctl_sk ,
2015-01-29 21:35:05 -08:00
skb , & TCP_SKB_CB ( skb ) - > header . h4 . opt ,
2014-09-27 09:50:55 -07:00
ip_hdr ( skb ) - > saddr , ip_hdr ( skb ) - > daddr ,
2019-06-13 21:22:35 -07:00
& arg , arg . iov [ 0 ] . iov_len ,
2023-05-23 18:14:52 +02:00
transmit_time , txhash ) ;
2005-04-16 15:20:36 -07:00
2022-07-07 10:01:39 +00:00
xfrm_sk_free_policy ( ctl_sk ) ;
2022-01-24 12:24:57 -08:00
sock_net_set ( ctl_sk , & init_net ) ;
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_OUTSEGS ) ;
__TCP_INC_STATS ( net , TCP_MIB_OUTRSTS ) ;
2016-05-06 09:46:18 -07:00
local_bh_enable ( ) ;
2012-01-31 22:35:48 +00:00
# ifdef CONFIG_TCP_MD5SIG
2016-04-01 08:52:17 -07:00
out :
rcu_read_unlock ( ) ;
2012-01-31 22:35:48 +00:00
# endif
2005-04-16 15:20:36 -07:00
}
/* The code following below sending ACKs in SYN-RECV and TIME-WAIT states
outside socket context is ugly , certainly . What can I do ?
*/
2016-11-04 02:23:43 +09:00
static void tcp_v4_send_ack ( const struct sock * sk ,
2016-01-21 08:02:54 -08:00
struct sk_buff * skb , u32 seq , u32 ack ,
2013-02-11 05:50:19 +00:00
u32 win , u32 tsval , u32 tsecr , int oif ,
2008-10-01 07:41:00 -07:00
struct tcp_md5sig_key * key ,
2023-05-23 18:14:52 +02:00
int reply_flags , u8 tos , u32 txhash )
2005-04-16 15:20:36 -07:00
{
2011-10-21 05:22:42 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
struct {
struct tcphdr th ;
2006-11-14 20:51:49 -08:00
__be32 opt [ ( TCPOLEN_TSTAMP_ALIGNED > > 2 )
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2006-11-14 20:51:49 -08:00
+ ( TCPOLEN_MD5SIG_ALIGNED > > 2 )
2006-11-14 19:07:45 -08:00
# endif
] ;
2005-04-16 15:20:36 -07:00
} rep ;
2016-11-04 02:23:43 +09:00
struct net * net = sock_net ( sk ) ;
2005-04-16 15:20:36 -07:00
struct ip_reply_arg arg ;
2018-05-10 16:53:51 +10:00
struct sock * ctl_sk ;
2019-06-13 21:22:35 -07:00
u64 transmit_time ;
2005-04-16 15:20:36 -07:00
memset ( & rep . th , 0 , sizeof ( struct tcphdr ) ) ;
2006-11-17 10:57:30 -02:00
memset ( & arg , 0 , sizeof ( arg ) ) ;
2005-04-16 15:20:36 -07:00
arg . iov [ 0 ] . iov_base = ( unsigned char * ) & rep ;
arg . iov [ 0 ] . iov_len = sizeof ( rep . th ) ;
2013-02-11 05:50:19 +00:00
if ( tsecr ) {
2006-11-14 19:07:45 -08:00
rep . opt [ 0 ] = htonl ( ( TCPOPT_NOP < < 24 ) | ( TCPOPT_NOP < < 16 ) |
( TCPOPT_TIMESTAMP < < 8 ) |
TCPOLEN_TIMESTAMP ) ;
2013-02-11 05:50:19 +00:00
rep . opt [ 1 ] = htonl ( tsval ) ;
rep . opt [ 2 ] = htonl ( tsecr ) ;
2007-01-09 00:11:15 -08:00
arg . iov [ 0 ] . iov_len + = TCPOLEN_TSTAMP_ALIGNED ;
2005-04-16 15:20:36 -07:00
}
/* Swap the send and the receive. */
rep . th . dest = th - > source ;
rep . th . source = th - > dest ;
rep . th . doff = arg . iov [ 0 ] . iov_len / 4 ;
rep . th . seq = htonl ( seq ) ;
rep . th . ack_seq = htonl ( ack ) ;
rep . th . ack = 1 ;
rep . th . window = htons ( win ) ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
if ( key ) {
2013-02-11 05:50:19 +00:00
int offset = ( tsecr ) ? 3 : 0 ;
2006-11-14 19:07:45 -08:00
rep . opt [ offset + + ] = htonl ( ( TCPOPT_NOP < < 24 ) |
( TCPOPT_NOP < < 16 ) |
( TCPOPT_MD5SIG < < 8 ) |
TCPOLEN_MD5SIG ) ;
arg . iov [ 0 ] . iov_len + = TCPOLEN_MD5SIG_ALIGNED ;
rep . th . doff = arg . iov [ 0 ] . iov_len / 4 ;
2008-07-19 00:01:42 -07:00
tcp_v4_md5_hash_hdr ( ( __u8 * ) & rep . opt [ offset ] ,
2008-07-31 20:49:48 -07:00
key , ip_hdr ( skb ) - > saddr ,
ip_hdr ( skb ) - > daddr , & rep . th ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2008-10-01 07:41:00 -07:00
arg . flags = reply_flags ;
2007-04-20 22:47:35 -07:00
arg . csum = csum_tcpudp_nofold ( ip_hdr ( skb ) - > daddr ,
ip_hdr ( skb ) - > saddr , /* XXX */
2005-04-16 15:20:36 -07:00
arg . iov [ 0 ] . iov_len , IPPROTO_TCP , 0 ) ;
arg . csumoffset = offsetof ( struct tcphdr , check ) / 2 ;
2008-04-18 12:45:16 +09:00
if ( oif )
arg . bound_dev_if = oif ;
2011-10-24 03:06:21 -04:00
arg . tos = tos ;
2016-11-04 02:23:43 +09:00
arg . uid = sock_net_uid ( net , sk_fullsock ( sk ) ? sk : NULL ) ;
2016-05-06 09:46:18 -07:00
local_bh_disable ( ) ;
2022-01-24 12:24:57 -08:00
ctl_sk = this_cpu_read ( ipv4_tcp_sk ) ;
sock_net_set ( ctl_sk , net ) ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:57:25 -07:00
ctl_sk - > sk_mark = ( sk - > sk_state = = TCP_TIME_WAIT ) ?
inet_twsk ( sk ) - > tw_mark : sk - > sk_mark ;
2019-09-24 08:01:16 -07:00
ctl_sk - > sk_priority = ( sk - > sk_state = = TCP_TIME_WAIT ) ?
inet_twsk ( sk ) - > tw_priority : sk - > sk_priority ;
2019-06-13 21:22:35 -07:00
transmit_time = tcp_transmit_time ( sk ) ;
2018-05-10 16:53:51 +10:00
ip_send_unicast_reply ( ctl_sk ,
2015-01-29 21:35:05 -08:00
skb , & TCP_SKB_CB ( skb ) - > header . h4 . opt ,
2014-09-27 09:50:55 -07:00
ip_hdr ( skb ) - > saddr , ip_hdr ( skb ) - > daddr ,
2019-06-13 21:22:35 -07:00
& arg , arg . iov [ 0 ] . iov_len ,
2023-05-23 18:14:52 +02:00
transmit_time , txhash ) ;
2005-04-16 15:20:36 -07:00
2022-01-24 12:24:57 -08:00
sock_net_set ( ctl_sk , & init_net ) ;
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_OUTSEGS ) ;
2016-05-06 09:46:18 -07:00
local_bh_enable ( ) ;
2005-04-16 15:20:36 -07:00
}
static void tcp_v4_timewait_ack ( struct sock * sk , struct sk_buff * skb )
{
2005-08-09 20:09:30 -07:00
struct inet_timewait_sock * tw = inet_twsk ( sk ) ;
2006-11-14 19:07:45 -08:00
struct tcp_timewait_sock * tcptw = tcp_twsk ( sk ) ;
2005-04-16 15:20:36 -07:00
2016-11-04 02:23:43 +09:00
tcp_v4_send_ack ( sk , skb ,
2016-01-21 08:02:54 -08:00
tcptw - > tw_snd_nxt , tcptw - > tw_rcv_nxt ,
2006-11-17 10:57:30 -02:00
tcptw - > tw_rcv_wnd > > tw - > tw_rcv_wscale ,
2017-05-16 14:00:14 -07:00
tcp_time_stamp_raw ( ) + tcptw - > tw_ts_offset ,
2008-04-18 12:45:16 +09:00
tcptw - > tw_ts_recent ,
tw - > tw_bound_dev_if ,
2008-10-01 07:41:00 -07:00
tcp_twsk_md5_key ( tcptw ) ,
2011-10-24 03:06:21 -04:00
tw - > tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0 ,
2023-05-23 18:14:52 +02:00
tw - > tw_tos ,
tw - > tw_txhash
2008-04-18 12:45:16 +09:00
) ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:09:30 -07:00
inet_twsk_put ( tw ) ;
2005-04-16 15:20:36 -07:00
}
2015-09-29 07:42:39 -07:00
static void tcp_v4_reqsk_send_ack ( const struct sock * sk , struct sk_buff * skb ,
2006-11-17 10:57:30 -02:00
struct request_sock * req )
2005-04-16 15:20:36 -07:00
{
2019-12-30 14:14:25 -08:00
const union tcp_md5_addr * addr ;
2019-12-30 14:14:28 -08:00
int l3index ;
2019-12-30 14:14:25 -08:00
2012-08-31 12:29:13 +00:00
/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
* sk - > sk_state = = TCP_SYN_RECV - > for Fast Open .
*/
2016-01-21 08:02:54 -08:00
u32 seq = ( sk - > sk_state = = TCP_LISTEN ) ? tcp_rsk ( req ) - > snt_isn + 1 :
tcp_sk ( sk ) - > snd_nxt ;
tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.
Tested:
Following packetdrill test demonstrates the issue :
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>
+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 11:31:10 -07:00
/* RFC 7323 2.3
* The window field ( SEG . WND ) of every outgoing segment , with the
* exception of < SYN > segments , MUST be right - shifted by
* Rcv . Wind . Shift bits :
*/
2019-12-30 14:14:25 -08:00
addr = ( union tcp_md5_addr * ) & ip_hdr ( skb ) - > saddr ;
2019-12-30 14:14:28 -08:00
l3index = tcp_v4_sdif ( skb ) ? inet_iif ( skb ) : 0 ;
2016-11-04 02:23:43 +09:00
tcp_v4_send_ack ( sk , skb , seq ,
tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
When sending an ack in SYN_RECV state, we must scale the offered
window if wscale option was negotiated and accepted.
Tested:
Following packetdrill test demonstrates the issue :
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// Establish a connection.
+0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0>
+0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7>
+0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100>
// check that window is properly scaled !
+0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 11:31:10 -07:00
tcp_rsk ( req ) - > rcv_nxt ,
req - > rsk_rcv_wnd > > inet_rsk ( req ) - > rcv_wscale ,
2017-05-16 14:00:14 -07:00
tcp_time_stamp_raw ( ) + tcp_rsk ( req ) - > ts_off ,
2008-04-18 12:45:16 +09:00
req - > ts_recent ,
0 ,
2019-12-30 14:14:28 -08:00
tcp_md5_do_lookup ( sk , l3index , addr , AF_INET ) ,
2011-10-24 03:06:21 -04:00
inet_rsk ( req ) - > no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0 ,
2023-05-23 18:14:52 +02:00
ip_hdr ( skb ) - > tos , tcp_rsk ( req ) - > txhash ) ;
2005-04-16 15:20:36 -07:00
}
/*
2008-02-17 22:29:19 -08:00
* Send a SYN - ACK after having received a SYN .
2005-06-18 22:47:21 -07:00
* This still operates on a request_sock only , not on a big
2005-04-16 15:20:36 -07:00
* socket .
*/
2015-09-25 07:39:21 -07:00
static int tcp_v4_send_synack ( const struct sock * sk , struct dst_entry * dst ,
2014-06-25 17:09:58 +03:00
struct flowi * fl ,
2010-01-17 19:09:39 -08:00
struct request_sock * req ,
2015-10-02 11:43:35 -07:00
struct tcp_fastopen_cookie * foc ,
2020-08-20 12:00:52 -07:00
enum tcp_synack_type synack_type ,
struct sk_buff * syn_skb )
2005-04-16 15:20:36 -07:00
{
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
const struct inet_request_sock * ireq = inet_rsk ( req ) ;
2011-05-18 18:32:03 -04:00
struct flowi4 fl4 ;
2005-04-16 15:20:36 -07:00
int err = - 1 ;
2013-12-23 14:37:28 +08:00
struct sk_buff * skb ;
2020-09-09 17:50:48 -07:00
u8 tos ;
2005-04-16 15:20:36 -07:00
/* First, grab a route. */
2012-07-17 14:02:46 -07:00
if ( ! dst & & ( dst = inet_csk_route_req ( sk , & fl4 , req ) ) = = NULL )
2008-02-29 11:43:03 -08:00
return - 1 ;
2005-04-16 15:20:36 -07:00
2020-08-20 12:00:52 -07:00
skb = tcp_make_synack ( sk , dst , req , foc , synack_type , syn_skb ) ;
2005-04-16 15:20:36 -07:00
if ( skb ) {
2013-10-09 15:21:29 -07:00
__tcp_v4_send_check ( skb , ireq - > ir_loc_addr , ireq - > ir_rmt_addr ) ;
2005-04-16 15:20:36 -07:00
2022-07-22 11:22:04 -07:00
tos = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_reflect_tos ) ?
2020-12-08 09:55:08 -08:00
( tcp_rsk ( req ) - > syn_tos & ~ INET_ECN_MASK ) |
( inet_sk ( sk ) - > tos & INET_ECN_MASK ) :
2020-11-20 19:47:44 -08:00
inet_sk ( sk ) - > tos ;
if ( ! INET_ECN_is_capable ( tos ) & &
tcp_bpf_ca_needs_ecn ( ( struct sock * ) req ) )
tos | = INET_ECN_ECT_0 ;
2018-10-02 12:35:05 -07:00
rcu_read_lock ( ) ;
2013-10-09 15:21:29 -07:00
err = ip_build_and_send_pkt ( skb , sk , ireq - > ir_loc_addr ,
ireq - > ir_rmt_addr ,
2020-09-09 17:50:47 -07:00
rcu_dereference ( ireq - > ireq_opt ) ,
2020-11-19 13:23:51 -08:00
tos ) ;
2018-10-02 12:35:05 -07:00
rcu_read_unlock ( ) ;
2006-11-14 11:21:36 -02:00
err = net_xmit_eval ( err ) ;
2005-04-16 15:20:36 -07:00
}
return err ;
}
/*
2005-06-18 22:47:21 -07:00
* IPv4 request_sock destructor .
2005-04-16 15:20:36 -07:00
*/
2005-06-18 22:47:21 -07:00
static void tcp_v4_reqsk_destructor ( struct request_sock * req )
2005-04-16 15:20:36 -07:00
{
2017-10-20 09:04:13 -07:00
kfree ( rcu_dereference_protected ( inet_rsk ( req ) - > ireq_opt , 1 ) ) ;
2005-04-16 15:20:36 -07:00
}
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
/*
* RFC2385 MD5 checksumming requires a mapping of
* IP address - > MD5 Key .
* We need to maintain these in the sk structure .
*/
2022-11-23 17:38:57 +00:00
DEFINE_STATIC_KEY_DEFERRED_FALSE ( tcp_md5_needed , HZ ) ;
2018-11-27 15:03:21 -08:00
EXPORT_SYMBOL ( tcp_md5_needed ) ;
2021-10-15 10:26:04 +03:00
static bool better_md5_match ( struct tcp_md5sig_key * old , struct tcp_md5sig_key * new )
{
if ( ! old )
return true ;
/* l3index always overrides non-l3index */
if ( old - > l3index & & new - > l3index = = 0 )
return false ;
if ( old - > l3index = = 0 & & new - > l3index )
return true ;
return old - > prefixlen < new - > prefixlen ;
}
2006-11-14 19:07:45 -08:00
/* Find the Key structure for an address. */
2019-12-30 14:14:28 -08:00
struct tcp_md5sig_key * __tcp_md5_do_lookup ( const struct sock * sk , int l3index ,
2018-11-27 15:03:21 -08:00
const union tcp_md5_addr * addr ,
int family )
2006-11-14 19:07:45 -08:00
{
2015-03-24 15:58:56 -07:00
const struct tcp_sock * tp = tcp_sk ( sk ) ;
2012-01-31 05:18:33 +00:00
struct tcp_md5sig_key * key ;
2015-03-24 15:58:56 -07:00
const struct tcp_md5sig_info * md5sig ;
2017-06-15 18:07:06 -07:00
__be32 mask ;
struct tcp_md5sig_key * best_match = NULL ;
bool match ;
2006-11-14 19:07:45 -08:00
2012-01-31 18:45:40 +00:00
/* caller either holds rcu_read_lock() or socket lock */
md5sig = rcu_dereference_check ( tp - > md5sig_info ,
2016-04-05 17:10:15 +02:00
lockdep_sock_is_held ( sk ) ) ;
2012-01-31 18:45:40 +00:00
if ( ! md5sig )
2006-11-14 19:07:45 -08:00
return NULL ;
2017-06-20 22:11:21 +02:00
2020-02-21 23:27:14 +05:30
hlist_for_each_entry_rcu ( key , & md5sig - > head , node ,
lockdep_sock_is_held ( sk ) ) {
2012-01-31 05:18:33 +00:00
if ( key - > family ! = family )
continue ;
2021-10-15 10:26:05 +03:00
if ( key - > flags & TCP_MD5SIG_FLAG_IFINDEX & & key - > l3index ! = l3index )
2019-12-30 14:14:28 -08:00
continue ;
2017-06-15 18:07:06 -07:00
if ( family = = AF_INET ) {
mask = inet_make_mask ( key - > prefixlen ) ;
match = ( key - > addr . a4 . s_addr & mask ) = =
( addr - > a4 . s_addr & mask ) ;
# if IS_ENABLED(CONFIG_IPV6)
} else if ( family = = AF_INET6 ) {
match = ipv6_prefix_equal ( & key - > addr . a6 , & addr - > a6 ,
key - > prefixlen ) ;
# endif
} else {
match = false ;
}
2021-10-15 10:26:04 +03:00
if ( match & & better_md5_match ( best_match , key ) )
2017-06-15 18:07:06 -07:00
best_match = key ;
}
return best_match ;
}
2018-11-27 15:03:21 -08:00
EXPORT_SYMBOL ( __tcp_md5_do_lookup ) ;
2017-06-15 18:07:06 -07:00
2017-07-06 07:58:53 +08:00
static struct tcp_md5sig_key * tcp_md5_do_lookup_exact ( const struct sock * sk ,
const union tcp_md5_addr * addr ,
2019-12-30 14:14:28 -08:00
int family , u8 prefixlen ,
2021-10-15 10:26:05 +03:00
int l3index , u8 flags )
2017-06-15 18:07:06 -07:00
{
const struct tcp_sock * tp = tcp_sk ( sk ) ;
struct tcp_md5sig_key * key ;
unsigned int size = sizeof ( struct in_addr ) ;
const struct tcp_md5sig_info * md5sig ;
/* caller either holds rcu_read_lock() or socket lock */
md5sig = rcu_dereference_check ( tp - > md5sig_info ,
lockdep_sock_is_held ( sk ) ) ;
if ( ! md5sig )
return NULL ;
# if IS_ENABLED(CONFIG_IPV6)
if ( family = = AF_INET6 )
size = sizeof ( struct in6_addr ) ;
# endif
2020-02-21 23:27:14 +05:30
hlist_for_each_entry_rcu ( key , & md5sig - > head , node ,
lockdep_sock_is_held ( sk ) ) {
2017-06-15 18:07:06 -07:00
if ( key - > family ! = family )
continue ;
2021-10-15 10:26:05 +03:00
if ( ( key - > flags & TCP_MD5SIG_FLAG_IFINDEX ) ! = ( flags & TCP_MD5SIG_FLAG_IFINDEX ) )
continue ;
2021-10-15 10:26:04 +03:00
if ( key - > l3index ! = l3index )
2019-12-30 14:14:28 -08:00
continue ;
2017-06-15 18:07:06 -07:00
if ( ! memcmp ( & key - > addr , addr , size ) & &
key - > prefixlen = = prefixlen )
2012-01-31 05:18:33 +00:00
return key ;
2006-11-14 19:07:45 -08:00
}
return NULL ;
}
2015-09-25 07:39:15 -07:00
struct tcp_md5sig_key * tcp_v4_md5_lookup ( const struct sock * sk ,
2015-03-24 15:58:56 -07:00
const struct sock * addr_sk )
2006-11-14 19:07:45 -08:00
{
2015-04-09 14:36:42 -07:00
const union tcp_md5_addr * addr ;
2019-12-30 14:14:28 -08:00
int l3index ;
2012-01-31 05:18:33 +00:00
2019-12-30 14:14:28 -08:00
l3index = l3mdev_master_ifindex_by_index ( sock_net ( sk ) ,
addr_sk - > sk_bound_dev_if ) ;
2015-04-09 14:36:42 -07:00
addr = ( const union tcp_md5_addr * ) & addr_sk - > sk_daddr ;
2019-12-30 14:14:28 -08:00
return tcp_md5_do_lookup ( sk , l3index , addr , AF_INET ) ;
2006-11-14 19:07:45 -08:00
}
EXPORT_SYMBOL ( tcp_v4_md5_lookup ) ;
2022-11-23 17:38:56 +00:00
static int tcp_md5sig_info_add ( struct sock * sk , gfp_t gfp )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
struct tcp_md5sig_info * md5sig ;
md5sig = kmalloc ( sizeof ( * md5sig ) , gfp ) ;
if ( ! md5sig )
return - ENOMEM ;
sk_gso_disable ( sk ) ;
INIT_HLIST_HEAD ( & md5sig - > head ) ;
rcu_assign_pointer ( tp - > md5sig_info , md5sig ) ;
return 0 ;
}
2006-11-14 19:07:45 -08:00
/* This can be called on a newly created socket, from other files */
2022-11-23 17:38:57 +00:00
static int __tcp_md5_do_add ( struct sock * sk , const union tcp_md5_addr * addr ,
int family , u8 prefixlen , int l3index , u8 flags ,
const u8 * newkey , u8 newkeylen , gfp_t gfp )
2006-11-14 19:07:45 -08:00
{
/* Add Key to the list */
2007-10-29 20:55:27 -07:00
struct tcp_md5sig_key * key ;
2006-11-14 19:07:45 -08:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2012-01-31 05:18:33 +00:00
struct tcp_md5sig_info * md5sig ;
2006-11-14 19:07:45 -08:00
2021-10-15 10:26:05 +03:00
key = tcp_md5_do_lookup_exact ( sk , addr , family , prefixlen , l3index , flags ) ;
2006-11-14 19:07:45 -08:00
if ( key ) {
2020-07-01 11:43:04 -07:00
/* Pre-existing entry - just update that one.
* Note that the key might be used concurrently .
* data_race ( ) is telling kcsan that we do not care of
* key mismatches , since changing MD5 key on live flows
* can lead to packet drops .
*/
data_race ( memcpy ( key - > key , newkey , newkeylen ) ) ;
2020-06-30 16:41:01 -07:00
2020-07-01 11:43:04 -07:00
/* Pairs with READ_ONCE() in tcp_md5_hash_key().
* Also note that a reader could catch new key - > keylen value
* but old key - > key [ ] , this is the reason we use __GFP_ZERO
* at sock_kmalloc ( ) time below these lines .
*/
WRITE_ONCE ( key - > keylen , newkeylen ) ;
2020-06-30 16:41:01 -07:00
2012-01-31 05:18:33 +00:00
return 0 ;
}
2011-09-29 17:10:10 +00:00
2012-01-31 18:45:40 +00:00
md5sig = rcu_dereference_protected ( tp - > md5sig_info ,
2016-04-05 17:10:15 +02:00
lockdep_sock_is_held ( sk ) ) ;
2006-11-14 19:07:45 -08:00
2020-07-01 11:43:04 -07:00
key = sock_kmalloc ( sk , sizeof ( * key ) , gfp | __GFP_ZERO ) ;
2012-01-31 05:18:33 +00:00
if ( ! key )
return - ENOMEM ;
2013-05-20 06:52:26 +00:00
if ( ! tcp_alloc_md5sig_pool ( ) ) {
2012-01-31 10:56:48 +00:00
sock_kfree_s ( sk , key , sizeof ( * key ) ) ;
2012-01-31 05:18:33 +00:00
return - ENOMEM ;
2006-11-14 19:07:45 -08:00
}
2012-01-31 05:18:33 +00:00
memcpy ( key - > key , newkey , newkeylen ) ;
key - > keylen = newkeylen ;
key - > family = family ;
2017-06-15 18:07:06 -07:00
key - > prefixlen = prefixlen ;
2019-12-30 14:14:28 -08:00
key - > l3index = l3index ;
2021-10-15 10:26:05 +03:00
key - > flags = flags ;
2012-01-31 05:18:33 +00:00
memcpy ( & key - > addr , addr ,
2022-05-26 18:12:13 +08:00
( IS_ENABLED ( CONFIG_IPV6 ) & & family = = AF_INET6 ) ? sizeof ( struct in6_addr ) :
sizeof ( struct in_addr ) ) ;
2012-01-31 05:18:33 +00:00
hlist_add_head_rcu ( & key - > node , & md5sig - > head ) ;
2006-11-14 19:07:45 -08:00
return 0 ;
}
2022-11-23 17:38:57 +00:00
int tcp_md5_do_add ( struct sock * sk , const union tcp_md5_addr * addr ,
int family , u8 prefixlen , int l3index , u8 flags ,
const u8 * newkey , u8 newkeylen )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( ! rcu_dereference_protected ( tp - > md5sig_info , lockdep_sock_is_held ( sk ) ) ) {
if ( tcp_md5sig_info_add ( sk , GFP_KERNEL ) )
return - ENOMEM ;
if ( ! static_branch_inc ( & tcp_md5_needed . key ) ) {
struct tcp_md5sig_info * md5sig ;
md5sig = rcu_dereference_protected ( tp - > md5sig_info , lockdep_sock_is_held ( sk ) ) ;
rcu_assign_pointer ( tp - > md5sig_info , NULL ) ;
2022-12-02 05:28:47 +00:00
kfree_rcu ( md5sig , rcu ) ;
2022-11-23 17:38:57 +00:00
return - EUSERS ;
}
}
return __tcp_md5_do_add ( sk , addr , family , prefixlen , l3index , flags ,
newkey , newkeylen , GFP_KERNEL ) ;
}
2012-01-31 05:18:33 +00:00
EXPORT_SYMBOL ( tcp_md5_do_add ) ;
2006-11-14 19:07:45 -08:00
2022-11-23 17:38:57 +00:00
int tcp_md5_key_copy ( struct sock * sk , const union tcp_md5_addr * addr ,
int family , u8 prefixlen , int l3index ,
struct tcp_md5sig_key * key )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( ! rcu_dereference_protected ( tp - > md5sig_info , lockdep_sock_is_held ( sk ) ) ) {
if ( tcp_md5sig_info_add ( sk , sk_gfp_mask ( sk , GFP_ATOMIC ) ) )
return - ENOMEM ;
if ( ! static_key_fast_inc_not_disabled ( & tcp_md5_needed . key . key ) ) {
struct tcp_md5sig_info * md5sig ;
md5sig = rcu_dereference_protected ( tp - > md5sig_info , lockdep_sock_is_held ( sk ) ) ;
net_warn_ratelimited ( " Too many TCP-MD5 keys in the system \n " ) ;
rcu_assign_pointer ( tp - > md5sig_info , NULL ) ;
2022-12-02 05:28:47 +00:00
kfree_rcu ( md5sig , rcu ) ;
2022-11-23 17:38:57 +00:00
return - EUSERS ;
}
}
return __tcp_md5_do_add ( sk , addr , family , prefixlen , l3index ,
key - > flags , key - > key , key - > keylen ,
sk_gfp_mask ( sk , GFP_ATOMIC ) ) ;
}
EXPORT_SYMBOL ( tcp_md5_key_copy ) ;
2017-06-15 18:07:06 -07:00
int tcp_md5_do_del ( struct sock * sk , const union tcp_md5_addr * addr , int family ,
2021-10-15 10:26:05 +03:00
u8 prefixlen , int l3index , u8 flags )
2006-11-14 19:07:45 -08:00
{
2012-01-31 05:18:33 +00:00
struct tcp_md5sig_key * key ;
2021-10-15 10:26:05 +03:00
key = tcp_md5_do_lookup_exact ( sk , addr , family , prefixlen , l3index , flags ) ;
2012-01-31 05:18:33 +00:00
if ( ! key )
return - ENOENT ;
hlist_del_rcu ( & key - > node ) ;
2012-01-31 10:56:48 +00:00
atomic_sub ( sizeof ( * key ) , & sk - > sk_omem_alloc ) ;
2012-01-31 05:18:33 +00:00
kfree_rcu ( key , rcu ) ;
return 0 ;
2006-11-14 19:07:45 -08:00
}
2012-01-31 05:18:33 +00:00
EXPORT_SYMBOL ( tcp_md5_do_del ) ;
2006-11-14 19:07:45 -08:00
2012-10-26 14:31:40 +00:00
static void tcp_clear_md5_list ( struct sock * sk )
2006-11-14 19:07:45 -08:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2012-01-31 05:18:33 +00:00
struct tcp_md5sig_key * key ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
struct hlist_node * n ;
2012-01-31 18:45:40 +00:00
struct tcp_md5sig_info * md5sig ;
2006-11-14 19:07:45 -08:00
2012-01-31 18:45:40 +00:00
md5sig = rcu_dereference_protected ( tp - > md5sig_info , 1 ) ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hlist_for_each_entry_safe ( key , n , & md5sig - > head , node ) {
2012-01-31 05:18:33 +00:00
hlist_del_rcu ( & key - > node ) ;
2012-01-31 10:56:48 +00:00
atomic_sub ( sizeof ( * key ) , & sk - > sk_omem_alloc ) ;
2012-01-31 05:18:33 +00:00
kfree_rcu ( key , rcu ) ;
2006-11-14 19:07:45 -08:00
}
}
2017-06-15 18:07:07 -07:00
static int tcp_v4_parse_md5_keys ( struct sock * sk , int optname ,
2020-07-23 08:09:05 +02:00
sockptr_t optval , int optlen )
2006-11-14 19:07:45 -08:00
{
struct tcp_md5sig cmd ;
struct sockaddr_in * sin = ( struct sockaddr_in * ) & cmd . tcpm_addr ;
2019-12-30 14:14:25 -08:00
const union tcp_md5_addr * addr ;
2017-06-15 18:07:07 -07:00
u8 prefixlen = 32 ;
2019-12-30 14:14:28 -08:00
int l3index = 0 ;
2021-10-15 10:26:05 +03:00
u8 flags ;
2006-11-14 19:07:45 -08:00
if ( optlen < sizeof ( cmd ) )
return - EINVAL ;
2020-07-23 08:09:05 +02:00
if ( copy_from_sockptr ( & cmd , optval , sizeof ( cmd ) ) )
2006-11-14 19:07:45 -08:00
return - EFAULT ;
if ( sin - > sin_family ! = AF_INET )
return - EINVAL ;
2021-10-15 10:26:05 +03:00
flags = cmd . tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX ;
2017-06-15 18:07:07 -07:00
if ( optname = = TCP_MD5SIG_EXT & &
cmd . tcpm_flags & TCP_MD5SIG_FLAG_PREFIX ) {
prefixlen = cmd . tcpm_prefixlen ;
if ( prefixlen > 32 )
return - EINVAL ;
}
2021-10-15 10:26:05 +03:00
if ( optname = = TCP_MD5SIG_EXT & & cmd . tcpm_ifindex & &
2019-12-30 14:14:29 -08:00
cmd . tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX ) {
struct net_device * dev ;
rcu_read_lock ( ) ;
dev = dev_get_by_index_rcu ( sock_net ( sk ) , cmd . tcpm_ifindex ) ;
if ( dev & & netif_is_l3_master ( dev ) )
l3index = dev - > ifindex ;
rcu_read_unlock ( ) ;
/* ok to reference set/not set outside of rcu;
* right now device MUST be an L3 master
*/
if ( ! dev | | ! l3index )
return - EINVAL ;
}
2019-12-30 14:14:25 -08:00
addr = ( union tcp_md5_addr * ) & sin - > sin_addr . s_addr ;
2014-08-03 22:45:19 +04:00
if ( ! cmd . tcpm_keylen )
2021-10-15 10:26:05 +03:00
return tcp_md5_do_del ( sk , addr , AF_INET , prefixlen , l3index , flags ) ;
2006-11-14 19:07:45 -08:00
if ( cmd . tcpm_keylen > TCP_MD5SIG_MAXKEYLEN )
return - EINVAL ;
2021-10-15 10:26:05 +03:00
return tcp_md5_do_add ( sk , addr , AF_INET , prefixlen , l3index , flags ,
2022-11-23 17:38:57 +00:00
cmd . tcpm_key , cmd . tcpm_keylen ) ;
2006-11-14 19:07:45 -08:00
}
2016-06-27 18:51:53 +02:00
static int tcp_v4_md5_hash_headers ( struct tcp_md5sig_pool * hp ,
__be32 daddr , __be32 saddr ,
const struct tcphdr * th , int nbytes )
2006-11-14 19:07:45 -08:00
{
struct tcp4_pseudohdr * bp ;
2008-07-19 00:01:42 -07:00
struct scatterlist sg ;
2016-06-27 18:51:53 +02:00
struct tcphdr * _th ;
2006-11-14 19:07:45 -08:00
2016-06-27 18:51:53 +02:00
bp = hp - > scratch ;
2006-11-14 19:07:45 -08:00
bp - > saddr = saddr ;
bp - > daddr = daddr ;
bp - > pad = 0 ;
2008-04-17 12:48:12 +09:00
bp - > protocol = IPPROTO_TCP ;
2008-07-19 00:01:42 -07:00
bp - > len = cpu_to_be16 ( nbytes ) ;
2007-10-26 00:41:21 -07:00
2016-06-27 18:51:53 +02:00
_th = ( struct tcphdr * ) ( bp + 1 ) ;
memcpy ( _th , th , sizeof ( * th ) ) ;
_th - > check = 0 ;
sg_init_one ( & sg , bp , sizeof ( * bp ) + sizeof ( * th ) ) ;
ahash_request_set_crypt ( hp - > md5_req , & sg , NULL ,
sizeof ( * bp ) + sizeof ( * th ) ) ;
2016-01-24 21:20:23 +08:00
return crypto_ahash_update ( hp - > md5_req ) ;
2008-07-19 00:01:42 -07:00
}
2012-01-31 05:18:33 +00:00
static int tcp_v4_md5_hash_hdr ( char * md5_hash , const struct tcp_md5sig_key * key ,
2011-10-24 02:46:04 -04:00
__be32 daddr , __be32 saddr , const struct tcphdr * th )
2008-07-19 00:01:42 -07:00
{
struct tcp_md5sig_pool * hp ;
2016-01-24 21:20:23 +08:00
struct ahash_request * req ;
2008-07-19 00:01:42 -07:00
hp = tcp_get_md5sig_pool ( ) ;
if ( ! hp )
goto clear_hash_noput ;
2016-01-24 21:20:23 +08:00
req = hp - > md5_req ;
2008-07-19 00:01:42 -07:00
2016-01-24 21:20:23 +08:00
if ( crypto_ahash_init ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
2016-06-27 18:51:53 +02:00
if ( tcp_v4_md5_hash_headers ( hp , daddr , saddr , th , th - > doff < < 2 ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
if ( tcp_md5_hash_key ( hp , key ) )
goto clear_hash ;
2016-01-24 21:20:23 +08:00
ahash_request_set_crypt ( req , NULL , md5_hash , 0 ) ;
if ( crypto_ahash_final ( req ) )
2006-11-14 19:07:45 -08:00
goto clear_hash ;
tcp_put_md5sig_pool ( ) ;
return 0 ;
2008-07-19 00:01:42 -07:00
2006-11-14 19:07:45 -08:00
clear_hash :
tcp_put_md5sig_pool ( ) ;
clear_hash_noput :
memset ( md5_hash , 0 , 16 ) ;
2008-07-19 00:01:42 -07:00
return 1 ;
2006-11-14 19:07:45 -08:00
}
2015-03-24 15:58:55 -07:00
int tcp_v4_md5_hash_skb ( char * md5_hash , const struct tcp_md5sig_key * key ,
const struct sock * sk ,
2011-10-24 02:46:04 -04:00
const struct sk_buff * skb )
2006-11-14 19:07:45 -08:00
{
2008-07-19 00:01:42 -07:00
struct tcp_md5sig_pool * hp ;
2016-01-24 21:20:23 +08:00
struct ahash_request * req ;
2011-10-24 02:46:04 -04:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2006-11-14 19:07:45 -08:00
__be32 saddr , daddr ;
2015-03-24 15:58:55 -07:00
if ( sk ) { /* valid for establish/request sockets */
saddr = sk - > sk_rcv_saddr ;
daddr = sk - > sk_daddr ;
2006-11-14 19:07:45 -08:00
} else {
2008-07-19 00:01:42 -07:00
const struct iphdr * iph = ip_hdr ( skb ) ;
saddr = iph - > saddr ;
daddr = iph - > daddr ;
2006-11-14 19:07:45 -08:00
}
2008-07-19 00:01:42 -07:00
hp = tcp_get_md5sig_pool ( ) ;
if ( ! hp )
goto clear_hash_noput ;
2016-01-24 21:20:23 +08:00
req = hp - > md5_req ;
2008-07-19 00:01:42 -07:00
2016-01-24 21:20:23 +08:00
if ( crypto_ahash_init ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
2016-06-27 18:51:53 +02:00
if ( tcp_v4_md5_hash_headers ( hp , daddr , saddr , th , skb - > len ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
if ( tcp_md5_hash_skb_data ( hp , skb , th - > doff < < 2 ) )
goto clear_hash ;
if ( tcp_md5_hash_key ( hp , key ) )
goto clear_hash ;
2016-01-24 21:20:23 +08:00
ahash_request_set_crypt ( req , NULL , md5_hash , 0 ) ;
if ( crypto_ahash_final ( req ) )
2008-07-19 00:01:42 -07:00
goto clear_hash ;
tcp_put_md5sig_pool ( ) ;
return 0 ;
clear_hash :
tcp_put_md5sig_pool ( ) ;
clear_hash_noput :
memset ( md5_hash , 0 , 16 ) ;
return 1 ;
2006-11-14 19:07:45 -08:00
}
2008-07-19 00:01:42 -07:00
EXPORT_SYMBOL ( tcp_v4_md5_hash_skb ) ;
2006-11-14 19:07:45 -08:00
2015-10-02 11:43:28 -07:00
# endif
2015-09-25 07:39:08 -07:00
static void tcp_v4_init_req ( struct request_sock * req ,
const struct sock * sk_listener ,
2014-06-25 17:09:53 +03:00
struct sk_buff * skb )
{
struct inet_request_sock * ireq = inet_rsk ( req ) ;
2017-10-20 09:04:13 -07:00
struct net * net = sock_net ( sk_listener ) ;
2014-06-25 17:09:53 +03:00
2015-03-18 14:05:38 -07:00
sk_rcv_saddr_set ( req_to_sk ( req ) , ip_hdr ( skb ) - > daddr ) ;
sk_daddr_set ( req_to_sk ( req ) , ip_hdr ( skb ) - > saddr ) ;
2017-10-20 09:04:13 -07:00
RCU_INIT_POINTER ( ireq - > ireq_opt , tcp_v4_save_options ( net , skb ) ) ;
2014-06-25 17:09:53 +03:00
}
2015-09-29 07:42:50 -07:00
static struct dst_entry * tcp_v4_route_req ( const struct sock * sk ,
2020-11-30 16:36:30 +01:00
struct sk_buff * skb ,
2015-09-29 07:42:50 -07:00
struct flowi * fl ,
2020-11-30 16:36:30 +01:00
struct request_sock * req )
2014-06-25 17:09:55 +03:00
{
2020-11-30 16:36:30 +01:00
tcp_v4_init_req ( req , sk , skb ) ;
if ( security_inet_conn_request ( sk , skb , req ) )
return NULL ;
2017-03-15 16:30:46 -04:00
return inet_csk_route_req ( sk , & fl - > u . ip4 , req ) ;
2014-06-25 17:09:55 +03:00
}
2006-11-16 02:30:37 -08:00
struct request_sock_ops tcp_request_sock_ops __read_mostly = {
2005-04-16 15:20:36 -07:00
. family = PF_INET ,
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
. obj_size = sizeof ( struct tcp_request_sock ) ,
2014-06-25 17:09:59 +03:00
. rtx_syn_ack = tcp_rtx_synack ,
2005-06-18 22:47:21 -07:00
. send_ack = tcp_v4_reqsk_send_ack ,
. destructor = tcp_v4_reqsk_destructor ,
2005-04-16 15:20:36 -07:00
. send_reset = tcp_v4_send_reset ,
2014-08-29 23:32:05 -07:00
. syn_ack_timeout = tcp_syn_ack_timeout ,
2005-04-16 15:20:36 -07:00
} ;
2020-01-09 07:59:21 -08:00
const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
2014-06-25 17:10:00 +03:00
. mss_clamp = TCP_MSS_DEFAULT ,
2014-06-25 17:09:53 +03:00
# ifdef CONFIG_TCP_MD5SIG
2015-03-24 15:58:56 -07:00
. req_md5_lookup = tcp_v4_md5_lookup ,
2009-07-16 05:04:51 +00:00
. calc_md5_hash = tcp_v4_md5_hash_skb ,
2006-11-30 19:16:28 -08:00
# endif
2014-06-25 17:09:54 +03:00
# ifdef CONFIG_SYN_COOKIES
. cookie_init_seq = cookie_v4_init_sequence ,
# endif
2014-06-25 17:09:55 +03:00
. route_req = tcp_v4_route_req ,
2017-05-05 06:56:54 -07:00
. init_seq = tcp_v4_init_seq ,
. init_ts_off = tcp_v4_init_ts_off ,
2014-06-25 17:09:58 +03:00
. send_synack = tcp_v4_send_synack ,
2014-06-25 17:09:53 +03:00
} ;
2006-11-14 19:07:45 -08:00
2005-04-16 15:20:36 -07:00
int tcp_v4_conn_request ( struct sock * sk , struct sk_buff * skb )
{
/* Never answer to SYNs send to broadcast or multicast */
2009-06-02 05:14:27 +00:00
if ( skb_rtable ( skb ) - > rt_flags & ( RTCF_BROADCAST | RTCF_MULTICAST ) )
2005-04-16 15:20:36 -07:00
goto drop ;
2014-06-25 17:10:02 +03:00
return tcp_conn_request ( & tcp_request_sock_ops ,
& tcp_request_sock_ipv4_ops , sk , skb ) ;
2005-04-16 15:20:36 -07:00
drop :
2016-04-01 08:52:20 -07:00
tcp_listendrop ( sk ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_v4_conn_request ) ;
2005-04-16 15:20:36 -07:00
/*
* The three way handshake has completed - we got a valid synack -
* now create the new socket .
*/
2015-09-29 07:42:48 -07:00
struct sock * tcp_v4_syn_recv_sock ( const struct sock * sk , struct sk_buff * skb ,
2005-06-18 22:47:21 -07:00
struct request_sock * req ,
2015-10-22 08:20:46 -07:00
struct dst_entry * dst ,
struct request_sock * req_unhash ,
bool * own_req )
2005-04-16 15:20:36 -07:00
{
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
struct inet_request_sock * ireq ;
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
bool found_dup_sk = false ;
2005-04-16 15:20:36 -07:00
struct inet_sock * newinet ;
struct tcp_sock * newtp ;
struct sock * newsk ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2019-12-30 14:14:25 -08:00
const union tcp_md5_addr * addr ;
2006-11-14 19:07:45 -08:00
struct tcp_md5sig_key * key ;
2019-12-30 14:14:28 -08:00
int l3index ;
2006-11-14 19:07:45 -08:00
# endif
2011-04-21 09:45:37 +00:00
struct ip_options_rcu * inet_opt ;
2005-04-16 15:20:36 -07:00
if ( sk_acceptq_is_full ( sk ) )
goto exit_overflow ;
newsk = tcp_create_openreq_child ( sk , req , skb ) ;
if ( ! newsk )
2010-10-21 13:06:43 +02:00
goto exit_nonewsk ;
2005-04-16 15:20:36 -07:00
2006-06-30 13:36:35 -07:00
newsk - > sk_gso_type = SKB_GSO_TCPV4 ;
2012-08-19 03:30:38 +00:00
inet_sk_rx_dst_set ( newsk , skb ) ;
2005-04-16 15:20:36 -07:00
newtp = tcp_sk ( newsk ) ;
newinet = inet_sk ( newsk ) ;
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
ireq = inet_rsk ( req ) ;
2015-03-18 14:05:35 -07:00
sk_daddr_set ( newsk , ireq - > ir_rmt_addr ) ;
sk_rcv_saddr_set ( newsk , ireq - > ir_loc_addr ) ;
2015-12-16 13:20:44 -08:00
newsk - > sk_bound_dev_if = ireq - > ir_iif ;
2017-10-20 09:04:13 -07:00
newinet - > inet_saddr = ireq - > ir_loc_addr ;
inet_opt = rcu_dereference ( ireq - > ireq_opt ) ;
RCU_INIT_POINTER ( newinet - > inet_opt , inet_opt ) ;
2005-08-09 20:10:42 -07:00
newinet - > mc_index = inet_iif ( skb ) ;
2007-04-20 22:47:35 -07:00
newinet - > mc_ttl = ip_hdr ( skb ) - > ttl ;
2012-02-09 09:35:49 +00:00
newinet - > rcv_tos = ip_hdr ( skb ) - > tos ;
2005-12-13 23:26:10 -08:00
inet_csk ( newsk ) - > icsk_ext_hdr_len = 0 ;
2011-04-21 09:45:37 +00:00
if ( inet_opt )
inet_csk ( newsk ) - > icsk_ext_hdr_len = inet_opt - > opt . optlen ;
2022-10-05 17:23:53 +02:00
newinet - > inet_id = get_random_u16 ( ) ;
2005-04-16 15:20:36 -07:00
2020-12-08 09:55:08 -08:00
/* Set ToS of the new socket based upon the value of incoming SYN.
* ECT bits are set later in tcp_init_transfer ( ) .
*/
2022-07-22 11:22:04 -07:00
if ( READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_reflect_tos ) )
2020-09-09 17:50:48 -07:00
newinet - > tos = tcp_rsk ( req ) - > syn_tos & ~ INET_ECN_MASK ;
2012-03-10 09:20:21 +00:00
if ( ! dst ) {
dst = inet_csk_route_child_sock ( sk , newsk , req ) ;
if ( ! dst )
goto put_and_exit ;
} else {
/* syncookie case : see end of cookie_v4_check() */
}
2011-05-08 15:28:03 -07:00
sk_setup_caps ( newsk , dst ) ;
net: tcp: add per route congestion control
This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.
For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.
The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.
The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.
Joint work with Florian Westphal.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 23:57:48 +01:00
tcp_ca_openreq_child ( newsk , dst ) ;
2005-04-16 15:20:36 -07:00
tcp_sync_mss ( newsk , dst_mtu ( dst ) ) ;
2017-02-02 08:04:56 -08:00
newtp - > advmss = tcp_mss_clamp ( tcp_sk ( sk ) , dst_metric_advmss ( dst ) ) ;
2008-09-21 00:21:51 -07:00
2005-04-16 15:20:36 -07:00
tcp_initialize_rcv_mss ( newsk ) ;
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2019-12-30 14:14:28 -08:00
l3index = l3mdev_master_ifindex_by_index ( sock_net ( sk ) , ireq - > ir_iif ) ;
2006-11-14 19:07:45 -08:00
/* Copy over the MD5 key from the original socket */
2019-12-30 14:14:25 -08:00
addr = ( union tcp_md5_addr * ) & newinet - > inet_daddr ;
2019-12-30 14:14:28 -08:00
key = tcp_md5_do_lookup ( sk , l3index , addr , AF_INET ) ;
2015-04-03 09:17:27 +01:00
if ( key ) {
2022-11-23 17:38:58 +00:00
if ( tcp_md5_key_copy ( newsk , addr , AF_INET , 32 , l3index , key ) )
goto put_and_exit ;
2021-11-15 11:02:35 -08:00
sk_gso_disable ( newsk ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2011-05-08 15:28:03 -07:00
if ( __inet_inherit_port ( sk , newsk ) < 0 )
goto put_and_exit ;
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
* own_req = inet_ehash_nolisten ( newsk , req_to_sk ( req_unhash ) ,
& found_dup_sk ) ;
2017-10-20 09:04:13 -07:00
if ( likely ( * own_req ) ) {
2015-11-05 12:50:19 -08:00
tcp_move_syn ( newtp , req ) ;
2017-10-20 09:04:13 -07:00
ireq - > ireq_opt = NULL ;
} else {
tcp: Fix potential use-after-free due to double kfree()
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.
The commit 01770a1661657 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:
sock_put
sk_free
__sk_free
sk_destruct
__sk_destruct
sk->sk_destruct/inet_sock_destruct
kfree(rcu_dereference_protected(inet->inet_opt, 1));
reqsk_put
reqsk_free
__reqsk_free
req->rsk_ops->destructor/tcp_v4_reqsk_destructor
kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().
As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.
Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from syncookies")
CC: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 14:59:20 +09:00
newinet - > inet_opt = NULL ;
tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 11:11:33 +00:00
if ( ! req_unhash & & found_dup_sk ) {
/* This code path should only be executed in the
* syncookie case only
*/
bh_unlock_sock ( newsk ) ;
sock_put ( newsk ) ;
newsk = NULL ;
}
2017-10-20 09:04:13 -07:00
}
2005-04-16 15:20:36 -07:00
return newsk ;
exit_overflow :
2016-04-29 14:16:47 -07:00
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_LISTENOVERFLOWS ) ;
2010-10-21 13:06:43 +02:00
exit_nonewsk :
dst_release ( dst ) ;
2005-04-16 15:20:36 -07:00
exit :
2016-04-01 08:52:20 -07:00
tcp_listendrop ( sk ) ;
2005-04-16 15:20:36 -07:00
return NULL ;
2011-05-08 15:28:03 -07:00
put_and_exit :
2017-10-20 09:04:13 -07:00
newinet - > inet_opt = NULL ;
inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:
unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
[<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
[<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
[<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
[<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
[<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
[<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
[<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
[<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
[<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
[<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
[<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
[<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
[<ffffffff814cee68>] ip_rcv+0x217/0x267
[<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
[<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.
This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...
Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.
Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
A similar approach is taken for dccp by calling dccp_done().
This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14 04:07:58 +00:00
inet_csk_prepare_forced_close ( newsk ) ;
tcp_done ( newsk ) ;
2011-05-08 15:28:03 -07:00
goto exit ;
2005-04-16 15:20:36 -07:00
}
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_v4_syn_recv_sock ) ;
2005-04-16 15:20:36 -07:00
2015-10-02 11:43:32 -07:00
static struct sock * tcp_v4_cookie_check ( struct sock * sk , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2015-10-02 11:43:32 -07:00
# ifdef CONFIG_SYN_COOKIES
2015-03-19 19:04:19 -07:00
const struct tcphdr * th = tcp_hdr ( skb ) ;
2005-04-16 15:20:36 -07:00
2010-06-03 00:43:44 +00:00
if ( ! th - > syn )
2014-10-15 14:33:22 -07:00
sk = cookie_v4_check ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
# endif
return sk ;
}
2019-07-29 09:59:14 -07:00
u16 tcp_v4_get_syncookie ( struct sock * sk , struct iphdr * iph ,
struct tcphdr * th , u32 * cookie )
{
u16 mss = 0 ;
# ifdef CONFIG_SYN_COOKIES
mss = tcp_get_syncookie_mss ( & tcp_request_sock_ops ,
& tcp_request_sock_ipv4_ops , sk , th ) ;
if ( mss ) {
* cookie = __cookie_v4_init_sequence ( iph , th , & mss ) ;
tcp_synq_overflow ( sk ) ;
}
# endif
return mss ;
}
2021-02-01 17:41:32 +00:00
INDIRECT_CALLABLE_DECLARE ( struct dst_entry * ipv4_dst_check ( struct dst_entry * ,
u32 ) ) ;
2005-04-16 15:20:36 -07:00
/* The socket must have it's spinlock held when we get
2015-10-02 11:43:39 -07:00
* here , unless it is a TCP_LISTEN socket .
2005-04-16 15:20:36 -07:00
*
* We have a potential double - lock case here , so even when
* doing backlog processing we use the BH locking scheme .
* This is because we cannot sleep with the original spinlock
* held .
*/
int tcp_v4_do_rcv ( struct sock * sk , struct sk_buff * skb )
{
2022-02-20 15:06:34 +08:00
enum skb_drop_reason reason ;
2006-11-14 19:07:45 -08:00
struct sock * rsk ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state = = TCP_ESTABLISHED ) { /* Fast path */
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
struct dst_entry * dst ;
dst = rcu_dereference_protected ( sk - > sk_rx_dst ,
lockdep_sock_is_held ( sk ) ) ;
2012-07-29 23:20:37 +00:00
2011-08-14 19:45:55 +00:00
sock_rps_save_rxhash ( sk , skb ) ;
2014-11-11 05:54:27 -08:00
sk_mark_napi_id ( sk , skb ) ;
2012-07-29 23:20:37 +00:00
if ( dst ) {
2021-10-25 09:48:16 -07:00
if ( sk - > sk_rx_dst_ifindex ! = skb - > skb_iif | |
2021-02-01 17:41:32 +00:00
! INDIRECT_CALL_1 ( dst - > ops - > check , ipv4_dst_check ,
dst , 0 ) ) {
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
RCU_INIT_POINTER ( sk - > sk_rx_dst , NULL ) ;
2012-07-23 16:29:00 -07:00
dst_release ( dst ) ;
}
}
2018-05-29 23:27:31 +08:00
tcp_rcv_established ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2022-02-20 15:06:34 +08:00
reason = SKB_DROP_REASON_NOT_SPECIFIED ;
2015-06-03 23:49:21 -07:00
if ( tcp_checksum_complete ( skb ) )
2005-04-16 15:20:36 -07:00
goto csum_err ;
if ( sk - > sk_state = = TCP_LISTEN ) {
2015-10-02 11:43:32 -07:00
struct sock * nsk = tcp_v4_cookie_check ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
if ( ! nsk )
goto discard ;
if ( nsk ! = sk ) {
2006-11-14 19:07:45 -08:00
if ( tcp_child_process ( sk , nsk , skb ) ) {
rsk = nsk ;
2005-04-16 15:20:36 -07:00
goto reset ;
2006-11-14 19:07:45 -08:00
}
2005-04-16 15:20:36 -07:00
return 0 ;
}
2010-06-03 09:03:58 +00:00
} else
2011-08-14 19:45:55 +00:00
sock_rps_save_rxhash ( sk , skb ) ;
2010-06-03 09:03:58 +00:00
2015-09-29 07:42:41 -07:00
if ( tcp_rcv_state_process ( sk , skb ) ) {
2006-11-14 19:07:45 -08:00
rsk = sk ;
2005-04-16 15:20:36 -07:00
goto reset ;
2006-11-14 19:07:45 -08:00
}
2005-04-16 15:20:36 -07:00
return 0 ;
reset :
2006-11-14 19:07:45 -08:00
tcp_v4_send_reset ( rsk , skb ) ;
2005-04-16 15:20:36 -07:00
discard :
2022-02-20 15:06:34 +08:00
kfree_skb_reason ( skb , reason ) ;
2005-04-16 15:20:36 -07:00
/* Be careful here. If this function gets more complicated and
* gcc suffers from register pressure on the x86 , sk ( in % ebx )
* might be destroyed here . This current version compiles correctly ,
* but you have been warned .
*/
return 0 ;
csum_err :
2022-02-20 15:06:34 +08:00
reason = SKB_DROP_REASON_TCP_CSUM ;
2021-05-14 13:04:25 -07:00
trace_tcp_bad_csum ( skb ) ;
2016-04-29 14:16:47 -07:00
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_CSUMERRORS ) ;
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_INERRS ) ;
2005-04-16 15:20:36 -07:00
goto discard ;
}
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_v4_do_rcv ) ;
2005-04-16 15:20:36 -07:00
2017-09-28 15:51:36 +02:00
int tcp_v4_early_demux ( struct sk_buff * skb )
2012-06-19 21:22:05 -07:00
{
2022-09-07 18:10:20 -07:00
struct net * net = dev_net ( skb - > dev ) ;
2012-06-19 21:22:05 -07:00
const struct iphdr * iph ;
const struct tcphdr * th ;
struct sock * sk ;
if ( skb - > pkt_type ! = PACKET_HOST )
2017-09-28 15:51:36 +02:00
return 0 ;
2012-06-19 21:22:05 -07:00
2012-10-22 21:42:47 +00:00
if ( ! pskb_may_pull ( skb , skb_transport_offset ( skb ) + sizeof ( struct tcphdr ) ) )
2017-09-28 15:51:36 +02:00
return 0 ;
2012-06-19 21:22:05 -07:00
iph = ip_hdr ( skb ) ;
2012-10-22 21:42:47 +00:00
th = tcp_hdr ( skb ) ;
2012-06-19 21:22:05 -07:00
if ( th - > doff < sizeof ( struct tcphdr ) / 4 )
2017-09-28 15:51:36 +02:00
return 0 ;
2012-06-19 21:22:05 -07:00
2022-09-07 18:10:20 -07:00
sk = __inet_lookup_established ( net , net - > ipv4 . tcp_death_row . hashinfo ,
2012-06-19 21:22:05 -07:00
iph - > saddr , th - > source ,
2012-06-23 17:38:10 +00:00
iph - > daddr , ntohs ( th - > dest ) ,
2017-08-07 08:44:17 -07:00
skb - > skb_iif , inet_sdif ( skb ) ) ;
2012-06-19 21:22:05 -07:00
if ( sk ) {
skb - > sk = sk ;
skb - > destructor = sock_edemux ;
2015-03-15 21:12:13 -07:00
if ( sk_fullsock ( sk ) ) {
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
struct dst_entry * dst = rcu_dereference ( sk - > sk_rx_dst ) ;
2012-07-27 06:23:40 +00:00
2012-06-19 21:22:05 -07:00
if ( dst )
dst = dst_check ( dst , 0 ) ;
2012-07-23 16:29:00 -07:00
if ( dst & &
2021-10-25 09:48:16 -07:00
sk - > sk_rx_dst_ifindex = = skb - > skb_iif )
2012-07-23 16:29:00 -07:00
skb_dst_set_noref ( skb , dst ) ;
2012-06-19 21:22:05 -07:00
}
}
2017-09-28 15:51:36 +02:00
return 0 ;
2012-06-19 21:22:05 -07:00
}
2022-02-20 15:06:33 +08:00
bool tcp_add_backlog ( struct sock * sk , struct sk_buff * skb ,
enum skb_drop_reason * reason )
2016-08-27 07:37:54 -07:00
{
2021-11-15 11:02:30 -08:00
u32 limit , tail_gso_size , tail_gso_segs ;
2018-11-27 14:42:03 -08:00
struct skb_shared_info * shinfo ;
const struct tcphdr * th ;
struct tcphdr * thtail ;
struct sk_buff * tail ;
unsigned int hdrlen ;
bool fragstolen ;
u32 gso_segs ;
2021-01-19 08:49:00 -08:00
u32 gso_size ;
2018-11-27 14:42:03 -08:00
int delta ;
2016-08-27 07:37:54 -07:00
/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
* we can fix skb - > truesize to its real value to avoid future drops .
* This is valid because skb is not yet charged to the socket .
* It has been noticed pure SACK packets were sometimes dropped
* ( if cooked by drivers without copybreak feature ) .
*/
2017-01-24 14:57:36 -08:00
skb_condense ( skb ) ;
2016-08-27 07:37:54 -07:00
2018-11-19 17:45:55 -08:00
skb_dst_drop ( skb ) ;
2018-11-27 14:42:03 -08:00
if ( unlikely ( tcp_checksum_complete ( skb ) ) ) {
bh_unlock_sock ( sk ) ;
2021-05-14 13:04:25 -07:00
trace_tcp_bad_csum ( skb ) ;
2022-02-20 15:06:33 +08:00
* reason = SKB_DROP_REASON_TCP_CSUM ;
2018-11-27 14:42:03 -08:00
__TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_CSUMERRORS ) ;
__TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_INERRS ) ;
return true ;
}
/* Attempt coalescing to last skb in backlog, even if we are
* above the limits .
* This is okay because skb capacity is limited to MAX_SKB_FRAGS .
*/
th = ( const struct tcphdr * ) skb - > data ;
hdrlen = th - > doff * 4 ;
tail = sk - > sk_backlog . tail ;
if ( ! tail )
goto no_coalesce ;
thtail = ( struct tcphdr * ) tail - > data ;
if ( TCP_SKB_CB ( tail ) - > end_seq ! = TCP_SKB_CB ( skb ) - > seq | |
TCP_SKB_CB ( tail ) - > ip_dsfield ! = TCP_SKB_CB ( skb ) - > ip_dsfield | |
( ( TCP_SKB_CB ( tail ) - > tcp_flags |
2019-04-26 10:10:05 -07:00
TCP_SKB_CB ( skb ) - > tcp_flags ) & ( TCPHDR_SYN | TCPHDR_RST | TCPHDR_URG ) ) | |
! ( ( TCP_SKB_CB ( tail ) - > tcp_flags &
TCP_SKB_CB ( skb ) - > tcp_flags ) & TCPHDR_ACK ) | |
2018-11-27 14:42:03 -08:00
( ( TCP_SKB_CB ( tail ) - > tcp_flags ^
TCP_SKB_CB ( skb ) - > tcp_flags ) & ( TCPHDR_ECE | TCPHDR_CWR ) ) | |
# ifdef CONFIG_TLS_DEVICE
tail - > decrypted ! = skb - > decrypted | |
# endif
thtail - > doff ! = th - > doff | |
memcmp ( thtail + 1 , th + 1 , hdrlen - sizeof ( * th ) ) )
goto no_coalesce ;
__skb_pull ( skb , hdrlen ) ;
2021-01-19 08:49:00 -08:00
shinfo = skb_shinfo ( skb ) ;
gso_size = shinfo - > gso_size ? : skb - > len ;
gso_segs = shinfo - > gso_segs ? : 1 ;
shinfo = skb_shinfo ( tail ) ;
tail_gso_size = shinfo - > gso_size ? : ( tail - > len - hdrlen ) ;
tail_gso_segs = shinfo - > gso_segs ? : 1 ;
2018-11-27 14:42:03 -08:00
if ( skb_try_coalesce ( tail , skb , & fragstolen , & delta ) ) {
TCP_SKB_CB ( tail ) - > end_seq = TCP_SKB_CB ( skb ) - > end_seq ;
tcp: fix receive window update in tcp_add_backlog()
We got reports from GKE customers flows being reset by netfilter
conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
Traces seemed to suggest ACK packet being dropped by the
packet capture, or more likely that ACK were received in the
wrong order.
wscale=7, SYN and SYNACK not shown here.
This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
New right edge of the window -> 51359321+1871*128=51598809
09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
Receiver now allows to send 606*128=77568 from seq 51521241 :
New right edge of the window -> 51521241+606*128=51598809
09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
netfilter conntrack is not happy and sends RST
09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
New right edge of the window -> 51521241+859*128=51631193
Normally TCP stack handles OOO packets just fine, but it
turns out tcp_add_backlog() does not. It can update the window
field of the aggregated packet even if the ACK sequence
of the last received packet is too old.
Many thanks to Alexandre Ferrieux for independently reporting the issue
and suggesting a fix.
Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 06:48:13 -07:00
if ( likely ( ! before ( TCP_SKB_CB ( skb ) - > ack_seq , TCP_SKB_CB ( tail ) - > ack_seq ) ) ) {
2018-11-27 14:42:03 -08:00
TCP_SKB_CB ( tail ) - > ack_seq = TCP_SKB_CB ( skb ) - > ack_seq ;
tcp: fix receive window update in tcp_add_backlog()
We got reports from GKE customers flows being reset by netfilter
conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
Traces seemed to suggest ACK packet being dropped by the
packet capture, or more likely that ACK were received in the
wrong order.
wscale=7, SYN and SYNACK not shown here.
This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
New right edge of the window -> 51359321+1871*128=51598809
09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
Receiver now allows to send 606*128=77568 from seq 51521241 :
New right edge of the window -> 51521241+606*128=51598809
09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
netfilter conntrack is not happy and sends RST
09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
New right edge of the window -> 51521241+859*128=51631193
Normally TCP stack handles OOO packets just fine, but it
turns out tcp_add_backlog() does not. It can update the window
field of the aggregated packet even if the ACK sequence
of the last received packet is too old.
Many thanks to Alexandre Ferrieux for independently reporting the issue
and suggesting a fix.
Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 06:48:13 -07:00
thtail - > window = th - > window ;
}
2018-11-27 14:42:03 -08:00
2019-04-26 10:10:05 -07:00
/* We have to update both TCP_SKB_CB(tail)->tcp_flags and
* thtail - > fin , so that the fast path in tcp_rcv_established ( )
* is not entered if we append a packet with a FIN .
* SYN , RST , URG are not present .
* ACK is set on both packets .
* PSH : we do not really care in TCP stack ,
* at least for ' GRO ' packets .
*/
thtail - > fin | = th - > fin ;
2018-11-27 14:42:03 -08:00
TCP_SKB_CB ( tail ) - > tcp_flags | = TCP_SKB_CB ( skb ) - > tcp_flags ;
if ( TCP_SKB_CB ( skb ) - > has_rxtstamp ) {
TCP_SKB_CB ( tail ) - > has_rxtstamp = true ;
tail - > tstamp = skb - > tstamp ;
skb_hwtstamps ( tail ) - > hwtstamp = skb_hwtstamps ( skb ) - > hwtstamp ;
}
/* Not as strict as GRO. We only need to carry mss max value */
2021-01-19 08:49:00 -08:00
shinfo - > gso_size = max ( gso_size , tail_gso_size ) ;
shinfo - > gso_segs = min_t ( u32 , gso_segs + tail_gso_segs , 0xFFFF ) ;
2018-11-27 14:42:03 -08:00
sk - > sk_backlog . len + = delta ;
__NET_INC_STATS ( sock_net ( sk ) ,
LINUX_MIB_TCPBACKLOGCOALESCE ) ;
kfree_skb_partial ( skb , fragstolen ) ;
return false ;
}
__skb_push ( skb , hdrlen ) ;
no_coalesce :
2022-10-21 12:06:22 +08:00
limit = ( u32 ) READ_ONCE ( sk - > sk_rcvbuf ) + ( u32 ) ( READ_ONCE ( sk - > sk_sndbuf ) > > 1 ) ;
2018-11-27 14:42:03 -08:00
/* Only socket owner can try to collapse/prune rx queues
* to reduce memory overhead , so add a little headroom here .
* Few sockets backlog are possibly concurrently non empty .
*/
2022-10-21 12:06:22 +08:00
limit + = 64 * 1024 ;
2018-11-27 14:42:03 -08:00
2016-08-27 07:37:54 -07:00
if ( unlikely ( sk_add_backlog ( sk , skb , limit ) ) ) {
bh_unlock_sock ( sk ) ;
2022-02-20 15:06:33 +08:00
* reason = SKB_DROP_REASON_SOCKET_BACKLOG ;
2016-08-27 07:37:54 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPBACKLOGDROP ) ;
return true ;
}
return false ;
}
EXPORT_SYMBOL ( tcp_add_backlog ) ;
2016-11-10 13:12:35 -08:00
int tcp_filter ( struct sock * sk , struct sk_buff * skb )
{
struct tcphdr * th = ( struct tcphdr * ) skb - > data ;
2019-03-11 11:41:05 -07:00
return sk_filter_trim_cap ( sk , skb , th - > doff * 4 ) ;
2016-11-10 13:12:35 -08:00
}
EXPORT_SYMBOL ( tcp_filter ) ;
2017-12-03 09:32:59 -08:00
static void tcp_v4_restore_cb ( struct sk_buff * skb )
{
memmove ( IPCB ( skb ) , & TCP_SKB_CB ( skb ) - > header . h4 ,
sizeof ( struct inet_skb_parm ) ) ;
}
static void tcp_v4_fill_cb ( struct sk_buff * skb , const struct iphdr * iph ,
const struct tcphdr * th )
{
/* This is tricky : We move IPCB at its correct location into TCP_SKB_CB()
* barrier ( ) makes sure compiler wont play fool ^ Waliasing games .
*/
memmove ( & TCP_SKB_CB ( skb ) - > header . h4 , IPCB ( skb ) ,
sizeof ( struct inet_skb_parm ) ) ;
barrier ( ) ;
TCP_SKB_CB ( skb ) - > seq = ntohl ( th - > seq ) ;
TCP_SKB_CB ( skb ) - > end_seq = ( TCP_SKB_CB ( skb ) - > seq + th - > syn + th - > fin +
skb - > len - th - > doff * 4 ) ;
TCP_SKB_CB ( skb ) - > ack_seq = ntohl ( th - > ack_seq ) ;
TCP_SKB_CB ( skb ) - > tcp_flags = tcp_flag_byte ( th ) ;
TCP_SKB_CB ( skb ) - > tcp_tw_isn = 0 ;
TCP_SKB_CB ( skb ) - > ip_dsfield = ipv4_get_dsfield ( iph ) ;
TCP_SKB_CB ( skb ) - > sacked = 0 ;
TCP_SKB_CB ( skb ) - > has_rxtstamp =
skb - > tstamp | | skb_hwtstamps ( skb ) - > hwtstamp ;
}
2005-04-16 15:20:36 -07:00
/*
* From tcp_input . c
*/
int tcp_v4_rcv ( struct sk_buff * skb )
{
2016-04-01 08:52:17 -07:00
struct net * net = dev_net ( skb - > dev ) ;
2022-02-20 15:06:32 +08:00
enum skb_drop_reason drop_reason ;
2017-08-07 08:44:17 -07:00
int sdif = inet_sdif ( skb ) ;
2019-12-30 14:14:27 -08:00
int dif = inet_iif ( skb ) ;
2007-04-20 22:47:35 -07:00
const struct iphdr * iph ;
2011-10-21 05:22:42 -04:00
const struct tcphdr * th ;
2016-04-01 08:52:17 -07:00
bool refcounted ;
2005-04-16 15:20:36 -07:00
struct sock * sk ;
int ret ;
2022-01-09 14:36:27 +08:00
drop_reason = SKB_DROP_REASON_NOT_SPECIFIED ;
2005-04-16 15:20:36 -07:00
if ( skb - > pkt_type ! = PACKET_HOST )
goto discard_it ;
/* Count it even if it's bad */
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_INSEGS ) ;
2005-04-16 15:20:36 -07:00
if ( ! pskb_may_pull ( skb , sizeof ( struct tcphdr ) ) )
goto discard_it ;
2016-05-13 09:16:40 -07:00
th = ( const struct tcphdr * ) skb - > data ;
2005-04-16 15:20:36 -07:00
2022-01-09 14:36:27 +08:00
if ( unlikely ( th - > doff < sizeof ( struct tcphdr ) / 4 ) ) {
drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL ;
2005-04-16 15:20:36 -07:00
goto bad_packet ;
2022-01-09 14:36:27 +08:00
}
2005-04-16 15:20:36 -07:00
if ( ! pskb_may_pull ( skb , th - > doff * 4 ) )
goto discard_it ;
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction ,
2005-11-10 17:13:47 -08:00
* provided case of th - > doff = = 0 is eliminated .
2005-04-16 15:20:36 -07:00
* So , we defer the checks . */
2014-05-02 16:29:38 -07:00
if ( skb_checksum_init ( skb , IPPROTO_TCP , inet_compute_pseudo ) )
2013-04-29 08:39:56 +00:00
goto csum_error ;
2005-04-16 15:20:36 -07:00
2016-05-13 09:16:40 -07:00
th = ( const struct tcphdr * ) skb - > data ;
2007-04-20 22:47:35 -07:00
iph = ip_hdr ( skb ) ;
2015-10-13 17:12:54 -07:00
lookup :
2022-09-07 18:10:20 -07:00
sk = __inet_lookup_skb ( net - > ipv4 . tcp_death_row . hashinfo ,
skb , __tcp_hdrlen ( th ) , th - > source ,
2017-08-07 08:44:17 -07:00
th - > dest , sdif , & refcounted ) ;
2005-04-16 15:20:36 -07:00
if ( ! sk )
goto no_tcp_socket ;
2010-03-09 05:55:56 +00:00
process :
if ( sk - > sk_state = = TCP_TIME_WAIT )
goto do_time_wait ;
2015-10-02 11:43:32 -07:00
if ( sk - > sk_state = = TCP_NEW_SYN_RECV ) {
struct request_sock * req = inet_reqsk ( sk ) ;
2018-02-13 06:14:12 -08:00
bool req_stolen = false ;
2016-02-18 05:39:18 -08:00
struct sock * nsk ;
2015-10-02 11:43:32 -07:00
sk = req - > rsk_listener ;
2022-06-23 05:04:36 +00:00
if ( ! xfrm4_policy_check ( sk , XFRM_POLICY_IN , skb ) )
drop_reason = SKB_DROP_REASON_XFRM_POLICY ;
else
drop_reason = tcp_inbound_md5_hash ( sk , skb ,
2022-03-07 16:44:21 -08:00
& iph - > saddr , & iph - > daddr ,
AF_INET , dif , sdif ) ;
if ( unlikely ( drop_reason ) ) {
2016-08-24 08:50:24 -07:00
sk_drops_add ( sk , skb ) ;
2016-02-11 22:50:29 -08:00
reqsk_put ( req ) ;
goto discard_it ;
}
2018-06-12 23:09:37 +00:00
if ( tcp_checksum_complete ( skb ) ) {
reqsk_put ( req ) ;
goto csum_error ;
}
2016-02-18 05:39:18 -08:00
if ( unlikely ( sk - > sk_state ! = TCP_LISTEN ) ) {
tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
This patch also changes the code to call reuseport_migrate_sock() and
inet_reqsk_clone(), but unlike the other cases, we do not call
inet_reqsk_clone() right after reuseport_migrate_sock().
Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:
(A) for listener itself
(B) carried by reuqest_sock
(C) sock_hold() in tcp_v[46]_rcv()
While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.
For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.
In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:
CPU 1 looks up req1
CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
2021-06-12 21:32:20 +09:00
nsk = reuseport_migrate_sock ( sk , req_to_sk ( req ) , skb ) ;
if ( ! nsk ) {
inet_csk_reqsk_queue_drop_and_put ( sk , req ) ;
goto lookup ;
}
sk = nsk ;
/* reuseport_migrate_sock() has already held one sk_refcnt
* before returning .
*/
} else {
/* We own a reference on the listener, increase it again
* as we might lose it too soon .
*/
sock_hold ( sk ) ;
2015-10-13 17:12:54 -07:00
}
2016-04-01 08:52:17 -07:00
refcounted = true ;
2017-09-08 12:44:47 -07:00
nsk = NULL ;
2017-12-03 09:32:59 -08:00
if ( ! tcp_filter ( sk , skb ) ) {
th = ( const struct tcphdr * ) skb - > data ;
iph = ip_hdr ( skb ) ;
tcp_v4_fill_cb ( skb , iph , th ) ;
2018-02-13 06:14:12 -08:00
nsk = tcp_check_req ( sk , skb , req , false , & req_stolen ) ;
2022-02-20 15:06:30 +08:00
} else {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER ;
2017-12-03 09:32:59 -08:00
}
2015-10-02 11:43:32 -07:00
if ( ! nsk ) {
reqsk_put ( req ) ;
2018-02-13 06:14:12 -08:00
if ( req_stolen ) {
/* Another cpu got exclusive access to req
* and created a full blown socket .
* Try to feed this packet to this socket
* instead of discarding it .
*/
tcp_v4_restore_cb ( skb ) ;
sock_put ( sk ) ;
goto lookup ;
}
2016-02-18 05:39:18 -08:00
goto discard_and_relse ;
2015-10-02 11:43:32 -07:00
}
2022-06-23 05:04:36 +00:00
nf_reset_ct ( skb ) ;
2015-10-02 11:43:32 -07:00
if ( nsk = = sk ) {
reqsk_put ( req ) ;
2017-12-03 09:32:59 -08:00
tcp_v4_restore_cb ( skb ) ;
2015-10-02 11:43:32 -07:00
} else if ( tcp_child_process ( sk , nsk , skb ) ) {
tcp_v4_send_reset ( nsk , skb ) ;
2016-02-18 05:39:18 -08:00
goto discard_and_relse ;
2015-10-02 11:43:32 -07:00
} else {
2016-02-18 05:39:18 -08:00
sock_put ( sk ) ;
2015-10-02 11:43:32 -07:00
return 0 ;
}
}
2021-10-25 09:48:23 -07:00
2021-10-25 09:48:24 -07:00
if ( static_branch_unlikely ( & ip4_min_ttl ) ) {
/* min_ttl can be changed concurrently from do_ip_setsockopt() */
if ( unlikely ( iph - > ttl < READ_ONCE ( inet_sk ( sk ) - > min_ttl ) ) ) {
__NET_INC_STATS ( net , LINUX_MIB_TCPMINTTLDROP ) ;
2023-02-01 17:43:45 +00:00
drop_reason = SKB_DROP_REASON_TCP_MINTTL ;
2021-10-25 09:48:24 -07:00
goto discard_and_relse ;
}
2010-03-07 23:21:57 +00:00
}
2010-01-11 16:28:01 -08:00
2022-02-20 15:06:30 +08:00
if ( ! xfrm4_policy_check ( sk , XFRM_POLICY_IN , skb ) ) {
drop_reason = SKB_DROP_REASON_XFRM_POLICY ;
2005-04-16 15:20:36 -07:00
goto discard_and_relse ;
2022-02-20 15:06:30 +08:00
}
2014-08-07 02:38:22 +04:00
2022-03-07 16:44:21 -08:00
drop_reason = tcp_inbound_md5_hash ( sk , skb , & iph - > saddr ,
& iph - > daddr , AF_INET , dif , sdif ) ;
if ( drop_reason )
2014-08-07 02:38:22 +04:00
goto discard_and_relse ;
2019-09-29 20:54:03 +02:00
nf_reset_ct ( skb ) ;
2005-04-16 15:20:36 -07:00
2022-01-09 14:36:27 +08:00
if ( tcp_filter ( sk , skb ) ) {
2022-01-27 17:13:01 +08:00
drop_reason = SKB_DROP_REASON_SOCKET_FILTER ;
2005-04-16 15:20:36 -07:00
goto discard_and_relse ;
2022-01-09 14:36:27 +08:00
}
2016-11-10 13:12:35 -08:00
th = ( const struct tcphdr * ) skb - > data ;
iph = ip_hdr ( skb ) ;
2017-12-03 09:32:59 -08:00
tcp_v4_fill_cb ( skb , iph , th ) ;
2005-04-16 15:20:36 -07:00
skb - > dev = NULL ;
2015-10-02 11:43:39 -07:00
if ( sk - > sk_state = = TCP_LISTEN ) {
ret = tcp_v4_do_rcv ( sk , skb ) ;
goto put_and_return ;
}
sk_incoming_cpu_update ( sk ) ;
2006-07-03 00:25:13 -07:00
bh_lock_sock_nested ( sk ) ;
2016-03-14 10:52:15 -07:00
tcp_segs_in ( tcp_sk ( sk ) , skb ) ;
2005-04-16 15:20:36 -07:00
ret = 0 ;
if ( ! sock_owned_by_user ( sk ) ) {
2017-07-30 03:57:18 +02:00
ret = tcp_v4_do_rcv ( sk , skb ) ;
tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 08:56:40 -07:00
} else {
2022-02-20 15:06:33 +08:00
if ( tcp_add_backlog ( sk , skb , & drop_reason ) )
tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 08:56:40 -07:00
goto discard_and_relse ;
2010-03-04 18:01:41 +00:00
}
2005-04-16 15:20:36 -07:00
bh_unlock_sock ( sk ) ;
2015-10-02 11:43:39 -07:00
put_and_return :
2016-04-01 08:52:17 -07:00
if ( refcounted )
sock_put ( sk ) ;
2005-04-16 15:20:36 -07:00
return ret ;
no_tcp_socket :
2022-01-09 14:36:27 +08:00
drop_reason = SKB_DROP_REASON_NO_SOCKET ;
2005-04-16 15:20:36 -07:00
if ( ! xfrm4_policy_check ( NULL , XFRM_POLICY_IN , skb ) )
goto discard_it ;
2017-12-03 09:32:59 -08:00
tcp_v4_fill_cb ( skb , iph , th ) ;
2015-06-03 23:49:21 -07:00
if ( tcp_checksum_complete ( skb ) ) {
2013-04-29 08:39:56 +00:00
csum_error :
2022-01-09 14:36:27 +08:00
drop_reason = SKB_DROP_REASON_TCP_CSUM ;
2021-05-14 13:04:25 -07:00
trace_tcp_bad_csum ( skb ) ;
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_CSUMERRORS ) ;
2005-04-16 15:20:36 -07:00
bad_packet :
2016-04-27 16:44:32 -07:00
__TCP_INC_STATS ( net , TCP_MIB_INERRS ) ;
2005-04-16 15:20:36 -07:00
} else {
2006-11-14 19:07:45 -08:00
tcp_v4_send_reset ( NULL , skb ) ;
2005-04-16 15:20:36 -07:00
}
discard_it :
2022-05-13 11:03:39 +08:00
SKB_DR_OR ( drop_reason , NOT_SPECIFIED ) ;
2005-04-16 15:20:36 -07:00
/* Discard frame. */
2022-01-09 14:36:27 +08:00
kfree_skb_reason ( skb , drop_reason ) ;
2007-02-09 23:24:47 +09:00
return 0 ;
2005-04-16 15:20:36 -07:00
discard_and_relse :
2016-04-01 08:52:19 -07:00
sk_drops_add ( sk , skb ) ;
2016-04-01 08:52:17 -07:00
if ( refcounted )
sock_put ( sk ) ;
2005-04-16 15:20:36 -07:00
goto discard_it ;
do_time_wait :
if ( ! xfrm4_policy_check ( NULL , XFRM_POLICY_IN , skb ) ) {
2022-02-20 15:06:30 +08:00
drop_reason = SKB_DROP_REASON_XFRM_POLICY ;
2006-10-10 19:41:46 -07:00
inet_twsk_put ( inet_twsk ( sk ) ) ;
2005-04-16 15:20:36 -07:00
goto discard_it ;
}
2017-12-03 09:32:59 -08:00
tcp_v4_fill_cb ( skb , iph , th ) ;
2013-04-29 08:39:56 +00:00
if ( tcp_checksum_complete ( skb ) ) {
inet_twsk_put ( inet_twsk ( sk ) ) ;
goto csum_error ;
2005-04-16 15:20:36 -07:00
}
2006-10-10 19:41:46 -07:00
switch ( tcp_timewait_state_process ( inet_twsk ( sk ) , skb , th ) ) {
2005-04-16 15:20:36 -07:00
case TCP_TW_SYN : {
2022-09-07 18:10:20 -07:00
struct sock * sk2 = inet_lookup_listener ( net ,
net - > ipv4 . tcp_death_row . hashinfo ,
skb , __tcp_hdrlen ( th ) ,
2013-01-22 09:50:24 +00:00
iph - > saddr , th - > source ,
2007-04-20 22:47:35 -07:00
iph - > daddr , th - > dest ,
2017-08-07 08:44:17 -07:00
inet_iif ( skb ) ,
sdif ) ;
2005-04-16 15:20:36 -07:00
if ( sk2 ) {
2015-07-08 14:28:30 -07:00
inet_twsk_deschedule_put ( inet_twsk ( sk ) ) ;
2005-04-16 15:20:36 -07:00
sk = sk2 ;
2017-12-03 09:32:59 -08:00
tcp_v4_restore_cb ( skb ) ;
2016-04-01 08:52:17 -07:00
refcounted = false ;
2005-04-16 15:20:36 -07:00
goto process ;
}
}
2017-10-16 15:48:55 -05:00
/* to ACK */
2020-03-12 15:50:22 -07:00
fallthrough ;
2005-04-16 15:20:36 -07:00
case TCP_TW_ACK :
tcp_v4_timewait_ack ( sk , skb ) ;
break ;
case TCP_TW_RST :
tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-21 21:29:26 +01:00
tcp_v4_send_reset ( sk , skb ) ;
inet_twsk_deschedule_put ( inet_twsk ( sk ) ) ;
goto discard_it ;
2005-04-16 15:20:36 -07:00
case TCP_TW_SUCCESS : ;
}
goto discard_it ;
}
2010-12-01 18:09:13 -08:00
static struct timewait_sock_ops tcp_timewait_sock_ops = {
. twsk_obj_size = sizeof ( struct tcp_timewait_sock ) ,
. twsk_unique = tcp_twsk_unique ,
. twsk_destructor = tcp_twsk_destructor ,
} ;
2005-04-16 15:20:36 -07:00
2012-08-09 14:11:00 +00:00
void inet_sk_rx_dst_set ( struct sock * sk , const struct sk_buff * skb )
2012-08-06 05:09:33 +00:00
{
struct dst_entry * dst = skb_dst ( skb ) ;
net: fix IP early demux races
David Wilder reported crashes caused by dst reuse.
<quote David>
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
</quote>
Thanks to his detailed report and analysis, we were able to understand
the core issue.
IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.
When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.
Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.
We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.
This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.
It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.
Can probably be backported back to linux-3.6 kernels
Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 14:08:53 -08:00
if ( dst & & dst_hold_safe ( dst ) ) {
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 06:33:30 -08:00
rcu_assign_pointer ( sk - > sk_rx_dst , dst ) ;
2021-10-25 09:48:16 -07:00
sk - > sk_rx_dst_ifindex = skb - > skb_iif ;
2014-09-08 08:06:07 -07:00
}
2012-08-06 05:09:33 +00:00
}
2012-08-09 14:11:00 +00:00
EXPORT_SYMBOL ( inet_sk_rx_dst_set ) ;
2012-08-06 05:09:33 +00:00
2009-09-01 19:25:04 +00:00
const struct inet_connection_sock_af_ops ipv4_specific = {
2006-03-20 22:48:35 -08:00
. queue_xmit = ip_queue_xmit ,
. send_check = tcp_v4_send_check ,
. rebuild_header = inet_sk_rebuild_header ,
2012-08-06 05:09:33 +00:00
. sk_rx_dst_set = inet_sk_rx_dst_set ,
2006-03-20 22:48:35 -08:00
. conn_request = tcp_v4_conn_request ,
. syn_recv_sock = tcp_v4_syn_recv_sock ,
. net_header_len = sizeof ( struct iphdr ) ,
. setsockopt = ip_setsockopt ,
. getsockopt = ip_getsockopt ,
. addr2sockaddr = inet_csk_addr2sockaddr ,
. sockaddr_len = sizeof ( struct sockaddr_in ) ,
2014-08-14 12:40:05 -04:00
. mtu_reduced = tcp_v4_mtu_reduced ,
2005-04-16 15:20:36 -07:00
} ;
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( ipv4_specific ) ;
2005-04-16 15:20:36 -07:00
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2009-09-01 19:25:03 +00:00
static const struct tcp_sock_af_ops tcp_sock_ipv4_specific = {
2006-11-14 19:07:45 -08:00
. md5_lookup = tcp_v4_md5_lookup ,
2008-07-19 00:01:42 -07:00
. calc_md5_hash = tcp_v4_md5_hash_skb ,
2006-11-14 19:07:45 -08:00
. md5_parse = tcp_v4_parse_md5_keys ,
} ;
2006-11-30 19:16:28 -08:00
# endif
2006-11-14 19:07:45 -08:00
2005-04-16 15:20:36 -07:00
/* NOTE: A lot of things set to zero explicitly by call to
* sk_alloc ( ) so need not be done here .
*/
static int tcp_v4_init_sock ( struct sock * sk )
{
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
2012-04-19 09:55:21 +00:00
tcp_init_sock ( sk ) ;
2005-04-16 15:20:36 -07:00
2005-12-13 23:15:52 -08:00
icsk - > icsk_af_ops = & ipv4_specific ;
2012-04-19 09:55:21 +00:00
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
2012-04-23 03:21:58 -04:00
tcp_sk ( sk ) - > af_specific = & tcp_sock_ipv4_specific ;
2006-11-14 19:07:45 -08:00
# endif
2005-04-16 15:20:36 -07:00
return 0 ;
}
2008-06-14 17:04:49 -07:00
void tcp_v4_destroy_sock ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2017-10-23 09:20:26 -07:00
trace_tcp_destroy_sock ( sk ) ;
2005-04-16 15:20:36 -07:00
tcp_clear_xmit_timers ( sk ) ;
2005-08-10 04:03:31 -03:00
tcp_cleanup_congestion_control ( sk ) ;
2005-06-23 12:19:55 -07:00
2017-06-14 11:37:14 -07:00
tcp_cleanup_ulp ( sk ) ;
2005-04-16 15:20:36 -07:00
/* Cleanup up the write buffer. */
2007-03-07 12:12:44 -08:00
tcp_write_queue_purge ( sk ) ;
2005-04-16 15:20:36 -07:00
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 14:45:46 -07:00
/* Check if we want to disable active TFO */
tcp_fastopen_active_disable_ofo_check ( sk ) ;
2005-04-16 15:20:36 -07:00
/* Cleans up our, hopefully empty, out_of_order_queue. */
tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-07 14:49:28 -07:00
skb_rbtree_purge ( & tp - > out_of_order_queue ) ;
2005-04-16 15:20:36 -07:00
2006-11-14 19:07:45 -08:00
# ifdef CONFIG_TCP_MD5SIG
/* Clean up the MD5 key list, if any */
if ( tp - > md5sig_info ) {
2012-01-31 05:18:33 +00:00
tcp_clear_md5_list ( sk ) ;
2017-12-21 10:29:10 -08:00
kfree_rcu ( rcu_dereference_protected ( tp - > md5sig_info , 1 ) , rcu ) ;
2006-11-14 19:07:45 -08:00
tp - > md5sig_info = NULL ;
2022-11-23 17:38:57 +00:00
static_branch_slow_dec_deferred ( & tcp_md5_needed ) ;
2006-11-14 19:07:45 -08:00
}
# endif
2006-05-23 18:05:53 -07:00
2005-04-16 15:20:36 -07:00
/* Clean up a referenced TCP bind bucket. */
2005-08-09 20:10:42 -07:00
if ( inet_csk ( sk ) - > icsk_bind_hash )
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 04:06:04 -08:00
inet_put_port ( sk ) ;
2005-04-16 15:20:36 -07:00
2019-10-10 20:17:38 -07:00
BUG_ON ( rcu_access_pointer ( tp - > fastopen_rsk ) ) ;
TCPCT part 1d: define TCP cookie option, extend existing struct's
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration). There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data. This is more flexible and
less subject to user configuration error. Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.
"Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
"Re: what a new TCP header might look like", May 12, 1998.
ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
These functions will also be used in subsequent patches that implement
additional features.
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 18:17:05 +00:00
2012-07-19 06:43:09 +00:00
/* If socket is aborted during connect operation */
tcp_free_fastopen_req ( tp ) ;
2017-10-18 11:22:51 -07:00
tcp_fastopen_destroy_cipher ( sk ) ;
2015-05-03 21:34:46 -07:00
tcp_saved_syn_free ( tp ) ;
2012-07-19 06:43:09 +00:00
2011-12-11 21:47:02 +00:00
sk_sockets_allocated_dec ( sk ) ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( tcp_v4_destroy_sock ) ;
# ifdef CONFIG_PROC_FS
/* Proc filesystem TCP sock list dumping. */
2021-07-01 13:05:48 -07:00
static unsigned short seq_file_family ( const struct seq_file * seq ) ;
static bool seq_sk_match ( struct seq_file * seq , const struct sock * sk )
{
unsigned short family = seq_file_family ( seq ) ;
/* AF_UNSPEC is used as a match all */
return ( ( family = = AF_UNSPEC | | family = = sk - > sk_family ) & &
net_eq ( sock_net ( sk ) , seq_file_net ( seq ) ) ) ;
}
2021-07-01 13:06:00 -07:00
/* Find a non empty bucket (starting from st->bucket)
* and return the first sk from it .
2010-06-07 00:43:42 -07:00
*/
2021-07-01 13:06:00 -07:00
static void * listening_get_first ( struct seq_file * seq )
2005-04-16 15:20:36 -07:00
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2005-04-16 15:20:36 -07:00
2021-07-01 13:06:00 -07:00
st - > offset = 0 ;
2022-09-07 18:10:20 -07:00
for ( ; st - > bucket < = hinfo - > lhash2_mask ; st - > bucket + + ) {
2021-07-01 13:06:06 -07:00
struct inet_listen_hashbucket * ilb2 ;
2022-05-11 17:06:05 -07:00
struct hlist_nulls_node * node ;
2021-07-01 13:06:00 -07:00
struct sock * sk ;
2020-06-23 16:08:04 -07:00
2022-09-07 18:10:20 -07:00
ilb2 = & hinfo - > lhash2 [ st - > bucket ] ;
2022-05-11 17:06:05 -07:00
if ( hlist_nulls_empty ( & ilb2 - > nulls_head ) )
2021-07-01 13:06:00 -07:00
continue ;
2021-07-01 13:06:06 -07:00
spin_lock ( & ilb2 - > lock ) ;
2022-05-11 17:06:05 -07:00
sk_nulls_for_each ( sk , node , & ilb2 - > nulls_head ) {
2021-07-01 13:06:00 -07:00
if ( seq_sk_match ( seq , sk ) )
return sk ;
}
2021-07-01 13:06:06 -07:00
spin_unlock ( & ilb2 - > lock ) ;
2005-04-16 15:20:36 -07:00
}
2021-07-01 13:06:00 -07:00
return NULL ;
}
/* Find the next sk of "cur" within the same bucket (i.e. st->bucket).
* If " cur " is the last one in the st - > bucket ,
* call listening_get_first ( ) to return the first sk of the next
* non empty bucket .
2010-06-07 00:43:42 -07:00
*/
2005-04-16 15:20:36 -07:00
static void * listening_get_next ( struct seq_file * seq , void * cur )
{
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2021-07-01 13:06:06 -07:00
struct inet_listen_hashbucket * ilb2 ;
2022-05-11 17:06:05 -07:00
struct hlist_nulls_node * node ;
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo ;
2016-04-01 08:52:17 -07:00
struct sock * sk = cur ;
2005-04-16 15:20:36 -07:00
+ + st - > num ;
2010-06-07 00:43:42 -07:00
+ + st - > offset ;
2005-04-16 15:20:36 -07:00
2022-05-11 17:06:05 -07:00
sk = sk_nulls_next ( sk ) ;
sk_nulls_for_each_from ( sk , node ) {
2021-07-01 13:05:48 -07:00
if ( seq_sk_match ( seq , sk ) )
2016-04-01 08:52:17 -07:00
return sk ;
2005-04-16 15:20:36 -07:00
}
2021-07-01 13:06:00 -07:00
2022-09-07 18:10:20 -07:00
hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
ilb2 = & hinfo - > lhash2 [ st - > bucket ] ;
2021-07-01 13:06:06 -07:00
spin_unlock ( & ilb2 - > lock ) ;
2021-07-01 13:06:00 -07:00
+ + st - > bucket ;
return listening_get_first ( seq ) ;
2005-04-16 15:20:36 -07:00
}
static void * listening_get_idx ( struct seq_file * seq , loff_t * pos )
{
2010-06-07 00:43:42 -07:00
struct tcp_iter_state * st = seq - > private ;
void * rc ;
st - > bucket = 0 ;
st - > offset = 0 ;
2021-07-01 13:06:00 -07:00
rc = listening_get_first ( seq ) ;
2005-04-16 15:20:36 -07:00
while ( rc & & * pos ) {
rc = listening_get_next ( seq , rc ) ;
- - * pos ;
}
return rc ;
}
2022-09-07 18:10:20 -07:00
static inline bool empty_bucket ( struct inet_hashinfo * hinfo ,
const struct tcp_iter_state * st )
2008-08-28 01:08:02 -07:00
{
2022-09-07 18:10:20 -07:00
return hlist_nulls_empty ( & hinfo - > ehash [ st - > bucket ] . chain ) ;
2008-08-28 01:08:02 -07:00
}
2010-06-07 00:43:42 -07:00
/*
* Get first established socket starting from bucket given in st - > bucket .
* If st - > bucket is zero , the very first socket in the hash is returned .
*/
2005-04-16 15:20:36 -07:00
static void * established_get_first ( struct seq_file * seq )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2020-06-23 16:08:04 -07:00
2010-06-07 00:43:42 -07:00
st - > offset = 0 ;
2022-09-07 18:10:20 -07:00
for ( ; st - > bucket < = hinfo - > ehash_mask ; + + st - > bucket ) {
2005-04-16 15:20:36 -07:00
struct sock * sk ;
2008-11-16 19:40:17 -08:00
struct hlist_nulls_node * node ;
2022-09-07 18:10:20 -07:00
spinlock_t * lock = inet_ehash_lockp ( hinfo , st - > bucket ) ;
2005-04-16 15:20:36 -07:00
2008-08-28 01:08:02 -07:00
/* Lockless fast path for the common case of empty buckets */
2022-09-07 18:10:20 -07:00
if ( empty_bucket ( hinfo , st ) )
2008-08-28 01:08:02 -07:00
continue ;
2008-11-20 20:39:09 -08:00
spin_lock_bh ( lock ) ;
2022-09-07 18:10:20 -07:00
sk_nulls_for_each ( sk , node , & hinfo - > ehash [ st - > bucket ] . chain ) {
2021-07-01 13:05:48 -07:00
if ( seq_sk_match ( seq , sk ) )
return sk ;
2005-04-16 15:20:36 -07:00
}
2008-11-20 20:39:09 -08:00
spin_unlock_bh ( lock ) ;
2005-04-16 15:20:36 -07:00
}
2021-07-01 13:05:48 -07:00
return NULL ;
2005-04-16 15:20:36 -07:00
}
static void * established_get_next ( struct seq_file * seq , void * cur )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2022-09-07 18:10:17 -07:00
struct hlist_nulls_node * node ;
struct sock * sk = cur ;
2020-06-23 16:08:04 -07:00
2005-04-16 15:20:36 -07:00
+ + st - > num ;
2010-06-07 00:43:42 -07:00
+ + st - > offset ;
2005-04-16 15:20:36 -07:00
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
sk = sk_nulls_next ( sk ) ;
2005-04-16 15:20:36 -07:00
2008-11-16 19:40:17 -08:00
sk_nulls_for_each_from ( sk , node ) {
2021-07-01 13:05:48 -07:00
if ( seq_sk_match ( seq , sk ) )
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
return sk ;
2005-04-16 15:20:36 -07:00
}
2022-09-07 18:10:20 -07:00
spin_unlock_bh ( inet_ehash_lockp ( hinfo , st - > bucket ) ) ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
+ + st - > bucket ;
return established_get_first ( seq ) ;
2005-04-16 15:20:36 -07:00
}
static void * established_get_idx ( struct seq_file * seq , loff_t pos )
{
2010-06-07 00:43:42 -07:00
struct tcp_iter_state * st = seq - > private ;
void * rc ;
st - > bucket = 0 ;
rc = established_get_first ( seq ) ;
2005-04-16 15:20:36 -07:00
while ( rc & & pos ) {
rc = established_get_next ( seq , rc ) ;
- - pos ;
2006-11-17 10:57:30 -02:00
}
2005-04-16 15:20:36 -07:00
return rc ;
}
static void * tcp_get_idx ( struct seq_file * seq , loff_t pos )
{
void * rc ;
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2005-04-16 15:20:36 -07:00
st - > state = TCP_SEQ_STATE_LISTENING ;
rc = listening_get_idx ( seq , & pos ) ;
if ( ! rc ) {
st - > state = TCP_SEQ_STATE_ESTABLISHED ;
rc = established_get_idx ( seq , pos ) ;
}
return rc ;
}
2010-06-07 00:43:42 -07:00
static void * tcp_seek_last_pos ( struct seq_file * seq )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2010-06-07 00:43:42 -07:00
struct tcp_iter_state * st = seq - > private ;
2021-07-01 13:05:41 -07:00
int bucket = st - > bucket ;
2010-06-07 00:43:42 -07:00
int offset = st - > offset ;
int orig_num = st - > num ;
void * rc = NULL ;
switch ( st - > state ) {
case TCP_SEQ_STATE_LISTENING :
2022-09-07 18:10:20 -07:00
if ( st - > bucket > hinfo - > lhash2_mask )
2010-06-07 00:43:42 -07:00
break ;
2021-07-01 13:06:00 -07:00
rc = listening_get_first ( seq ) ;
2021-07-01 13:05:41 -07:00
while ( offset - - & & rc & & bucket = = st - > bucket )
2010-06-07 00:43:42 -07:00
rc = listening_get_next ( seq , rc ) ;
if ( rc )
break ;
st - > bucket = 0 ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
st - > state = TCP_SEQ_STATE_ESTABLISHED ;
2020-03-12 15:50:22 -07:00
fallthrough ;
2010-06-07 00:43:42 -07:00
case TCP_SEQ_STATE_ESTABLISHED :
2022-09-07 18:10:20 -07:00
if ( st - > bucket > hinfo - > ehash_mask )
2010-06-07 00:43:42 -07:00
break ;
rc = established_get_first ( seq ) ;
2021-07-01 13:05:41 -07:00
while ( offset - - & & rc & & bucket = = st - > bucket )
2010-06-07 00:43:42 -07:00
rc = established_get_next ( seq , rc ) ;
}
st - > num = orig_num ;
return rc ;
}
2018-04-11 09:31:28 +02:00
void * tcp_seq_start ( struct seq_file * seq , loff_t * pos )
2005-04-16 15:20:36 -07:00
{
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2010-06-07 00:43:42 -07:00
void * rc ;
if ( * pos & & * pos = = st - > last_pos ) {
rc = tcp_seek_last_pos ( seq ) ;
if ( rc )
goto out ;
}
2005-04-16 15:20:36 -07:00
st - > state = TCP_SEQ_STATE_LISTENING ;
st - > num = 0 ;
2010-06-07 00:43:42 -07:00
st - > bucket = 0 ;
st - > offset = 0 ;
rc = * pos ? tcp_get_idx ( seq , * pos - 1 ) : SEQ_START_TOKEN ;
out :
st - > last_pos = * pos ;
return rc ;
2005-04-16 15:20:36 -07:00
}
2018-04-11 09:31:28 +02:00
EXPORT_SYMBOL ( tcp_seq_start ) ;
2005-04-16 15:20:36 -07:00
2018-04-11 09:31:28 +02:00
void * tcp_seq_next ( struct seq_file * seq , void * v , loff_t * pos )
2005-04-16 15:20:36 -07:00
{
2010-06-07 00:43:42 -07:00
struct tcp_iter_state * st = seq - > private ;
2005-04-16 15:20:36 -07:00
void * rc = NULL ;
if ( v = = SEQ_START_TOKEN ) {
rc = tcp_get_idx ( seq , 0 ) ;
goto out ;
}
switch ( st - > state ) {
case TCP_SEQ_STATE_LISTENING :
rc = listening_get_next ( seq , v ) ;
if ( ! rc ) {
st - > state = TCP_SEQ_STATE_ESTABLISHED ;
2010-06-07 00:43:42 -07:00
st - > bucket = 0 ;
st - > offset = 0 ;
2005-04-16 15:20:36 -07:00
rc = established_get_first ( seq ) ;
}
break ;
case TCP_SEQ_STATE_ESTABLISHED :
rc = established_get_next ( seq , v ) ;
break ;
}
out :
+ + * pos ;
2010-06-07 00:43:42 -07:00
st - > last_pos = * pos ;
2005-04-16 15:20:36 -07:00
return rc ;
}
2018-04-11 09:31:28 +02:00
EXPORT_SYMBOL ( tcp_seq_next ) ;
2005-04-16 15:20:36 -07:00
2018-04-11 09:31:28 +02:00
void tcp_seq_stop ( struct seq_file * seq , void * v )
2005-04-16 15:20:36 -07:00
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st = seq - > private ;
2005-04-16 15:20:36 -07:00
switch ( st - > state ) {
case TCP_SEQ_STATE_LISTENING :
if ( v ! = SEQ_START_TOKEN )
2022-09-07 18:10:20 -07:00
spin_unlock ( & hinfo - > lhash2 [ st - > bucket ] . lock ) ;
2005-04-16 15:20:36 -07:00
break ;
case TCP_SEQ_STATE_ESTABLISHED :
if ( v )
2022-09-07 18:10:20 -07:00
spin_unlock_bh ( inet_ehash_lockp ( hinfo , st - > bucket ) ) ;
2005-04-16 15:20:36 -07:00
break ;
}
}
2018-04-11 09:31:28 +02:00
EXPORT_SYMBOL ( tcp_seq_stop ) ;
2005-04-16 15:20:36 -07:00
2015-03-12 16:44:09 -07:00
static void get_openreq4 ( const struct request_sock * req ,
2015-10-02 11:43:30 -07:00
struct seq_file * f , int i )
2005-04-16 15:20:36 -07:00
{
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
const struct inet_request_sock * ireq = inet_rsk ( req ) ;
inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 19:04:20 -07:00
long delta = req - > rsk_timer . expires - jiffies ;
2005-04-16 15:20:36 -07:00
2008-04-24 01:02:16 -07:00
seq_printf ( f , " %4d: %08X:%04X %08X:%04X "
2013-11-14 14:31:57 -08:00
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %u %d %pK " ,
2005-04-16 15:20:36 -07:00
i ,
2013-10-09 15:21:29 -07:00
ireq - > ir_loc_addr ,
2015-03-12 16:44:09 -07:00
ireq - > ir_num ,
2013-10-09 15:21:29 -07:00
ireq - > ir_rmt_addr ,
ntohs ( ireq - > ir_rmt_port ) ,
2005-04-16 15:20:36 -07:00
TCP_SYN_RECV ,
0 , 0 , /* could print option size, but that is af dependent. */
1 , /* timers active (only the expire timer) */
2012-08-08 21:13:53 +00:00
jiffies_delta_to_clock_t ( delta ) ,
2012-10-27 23:16:46 +00:00
req - > num_timeout ,
2015-10-02 11:43:30 -07:00
from_kuid_munged ( seq_user_ns ( f ) ,
sock_i_uid ( req - > rsk_listener ) ) ,
2005-04-16 15:20:36 -07:00
0 , /* non standard timer */
0 , /* open_requests have no inode */
2015-03-12 16:44:09 -07:00
0 ,
2013-11-14 14:31:57 -08:00
req ) ;
2005-04-16 15:20:36 -07:00
}
2013-11-14 14:31:57 -08:00
static void get_tcp4_sock ( struct sock * sk , struct seq_file * f , int i )
2005-04-16 15:20:36 -07:00
{
int timer_active ;
unsigned long timer_expires ;
2011-10-21 05:22:42 -04:00
const struct tcp_sock * tp = tcp_sk ( sk ) ;
2007-02-22 01:13:58 -08:00
const struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2011-10-21 05:22:42 -04:00
const struct inet_sock * inet = inet_sk ( sk ) ;
2015-09-29 07:42:52 -07:00
const struct fastopen_queue * fastopenq = & icsk - > icsk_accept_queue . fastopenq ;
2009-10-15 06:30:45 +00:00
__be32 dest = inet - > inet_daddr ;
__be32 src = inet - > inet_rcv_saddr ;
__u16 destp = ntohs ( inet - > inet_dport ) ;
__u16 srcp = ntohs ( inet - > inet_sport ) ;
2009-12-03 16:06:13 -08:00
int rx_queue ;
2015-11-12 08:43:18 -08:00
int state ;
2005-04-16 15:20:36 -07:00
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
if ( icsk - > icsk_pending = = ICSK_TIME_RETRANS | |
2017-01-12 22:11:33 -08:00
icsk - > icsk_pending = = ICSK_TIME_REO_TIMEOUT | |
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
icsk - > icsk_pending = = ICSK_TIME_LOSS_PROBE ) {
2005-04-16 15:20:36 -07:00
timer_active = 1 ;
2005-08-09 20:10:42 -07:00
timer_expires = icsk - > icsk_timeout ;
} else if ( icsk - > icsk_pending = = ICSK_TIME_PROBE0 ) {
2005-04-16 15:20:36 -07:00
timer_active = 4 ;
2005-08-09 20:10:42 -07:00
timer_expires = icsk - > icsk_timeout ;
2007-02-22 01:13:58 -08:00
} else if ( timer_pending ( & sk - > sk_timer ) ) {
2005-04-16 15:20:36 -07:00
timer_active = 2 ;
2007-02-22 01:13:58 -08:00
timer_expires = sk - > sk_timer . expires ;
2005-04-16 15:20:36 -07:00
} else {
timer_active = 0 ;
timer_expires = jiffies ;
}
2017-12-20 11:12:52 +08:00
state = inet_sk_state_load ( sk ) ;
2015-11-12 08:43:18 -08:00
if ( state = = TCP_LISTEN )
2019-11-05 14:11:53 -08:00
rx_queue = READ_ONCE ( sk - > sk_ack_backlog ) ;
2009-12-03 16:06:13 -08:00
else
2015-11-12 08:43:18 -08:00
/* Because we don't lock the socket,
* we might find a transient negative value .
2009-12-03 16:06:13 -08:00
*/
2019-10-10 20:17:39 -07:00
rx_queue = max_t ( int , READ_ONCE ( tp - > rcv_nxt ) -
2019-10-10 20:17:40 -07:00
READ_ONCE ( tp - > copied_seq ) , 0 ) ;
2009-12-03 16:06:13 -08:00
2008-04-24 01:02:16 -07:00
seq_printf ( f , " %4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
2013-11-14 14:31:57 -08:00
" %08X %5u %8d %lu %d %pK %lu %lu %u %u %d " ,
2015-11-12 08:43:18 -08:00
i , src , srcp , dest , destp , state ,
2019-10-10 20:17:41 -07:00
READ_ONCE ( tp - > write_seq ) - tp - > snd_una ,
2009-12-03 16:06:13 -08:00
rx_queue ,
2005-04-16 15:20:36 -07:00
timer_active ,
2012-08-08 21:13:53 +00:00
jiffies_delta_to_clock_t ( timer_expires - jiffies ) ,
2005-08-09 20:10:42 -07:00
icsk - > icsk_retransmits ,
2012-05-24 01:10:10 -06:00
from_kuid_munged ( seq_user_ns ( f ) , sock_i_uid ( sk ) ) ,
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out ,
2007-02-22 01:13:58 -08:00
sock_i_ino ( sk ) ,
2017-06-30 13:08:01 +03:00
refcount_read ( & sk - > sk_refcnt ) , sk ,
2008-06-27 20:00:19 -07:00
jiffies_to_clock_t ( icsk - > icsk_rto ) ,
jiffies_to_clock_t ( icsk - > icsk_ack . ato ) ,
2019-01-25 10:53:19 -08:00
( icsk - > icsk_ack . quick < < 1 ) | inet_csk_in_pingpong_mode ( sk ) ,
2022-04-05 16:35:38 -07:00
tcp_snd_cwnd ( tp ) ,
2015-11-12 08:43:18 -08:00
state = = TCP_LISTEN ?
fastopenq - > max_qlen :
2013-11-14 14:31:57 -08:00
( tcp_in_initial_slowstart ( tp ) ? - 1 : tp - > snd_ssthresh ) ) ;
2005-04-16 15:20:36 -07:00
}
2011-10-21 05:22:42 -04:00
static void get_timewait4_sock ( const struct inet_timewait_sock * tw ,
2013-11-14 14:31:57 -08:00
struct seq_file * f , int i )
2005-04-16 15:20:36 -07:00
{
tcp/dccp: get rid of central timewait timer
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.
This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.
Tested:
On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)
Before patch :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171
While test is running, we can observe 25 or even 33 ms latencies.
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
After patch :
About 90% increase of throughput :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992
And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 18:51:09 -07:00
long delta = tw - > tw_timer . expires - jiffies ;
2006-09-27 18:43:50 -07:00
__be32 dest , src ;
2005-04-16 15:20:36 -07:00
__u16 destp , srcp ;
dest = tw - > tw_daddr ;
src = tw - > tw_rcv_saddr ;
destp = ntohs ( tw - > tw_dport ) ;
srcp = ntohs ( tw - > tw_sport ) ;
2008-04-24 01:02:16 -07:00
seq_printf ( f , " %4d: %08X:%04X %08X:%04X "
2013-11-14 14:31:57 -08:00
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK " ,
2005-04-16 15:20:36 -07:00
i , src , srcp , dest , destp , tw - > tw_substate , 0 , 0 ,
2012-08-08 21:13:53 +00:00
3 , jiffies_delta_to_clock_t ( delta ) , 0 , 0 , 0 , 0 ,
2017-06-30 13:08:01 +03:00
refcount_read ( & tw - > tw_refcnt ) , tw ) ;
2005-04-16 15:20:36 -07:00
}
# define TMPSZ 150
static int tcp4_seq_show ( struct seq_file * seq , void * v )
{
2008-11-03 02:49:10 -08:00
struct tcp_iter_state * st ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 00:22:02 -07:00
struct sock * sk = v ;
2005-04-16 15:20:36 -07:00
2013-11-14 14:31:57 -08:00
seq_setwidth ( seq , TMPSZ - 1 ) ;
2005-04-16 15:20:36 -07:00
if ( v = = SEQ_START_TOKEN ) {
2013-11-14 14:31:57 -08:00
seq_puts ( seq , " sl local_address rem_address st tx_queue "
2005-04-16 15:20:36 -07:00
" rx_queue tr tm->when retrnsmt uid timeout "
" inode " ) ;
goto out ;
}
st = seq - > private ;
2015-10-02 11:43:32 -07:00
if ( sk - > sk_state = = TCP_TIME_WAIT )
get_timewait4_sock ( v , seq , st - > num ) ;
else if ( sk - > sk_state = = TCP_NEW_SYN_RECV )
2015-10-02 11:43:30 -07:00
get_openreq4 ( v , seq , st - > num ) ;
2015-10-02 11:43:32 -07:00
else
get_tcp4_sock ( v , seq , st - > num ) ;
2005-04-16 15:20:36 -07:00
out :
2013-11-14 14:31:57 -08:00
seq_pad ( seq , ' \n ' ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2020-06-23 16:08:05 -07:00
# ifdef CONFIG_BPF_SYSCALL
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state {
struct tcp_iter_state state ;
unsigned int cur_sk ;
unsigned int end_sk ;
unsigned int max_sk ;
struct sock * * batch ;
bool st_bucket_done ;
} ;
2020-06-23 16:08:05 -07:00
struct bpf_iter__tcp {
__bpf_md_ptr ( struct bpf_iter_meta * , meta ) ;
__bpf_md_ptr ( struct sock_common * , sk_common ) ;
uid_t uid __aligned ( 8 ) ;
} ;
static int tcp_prog_seq_show ( struct bpf_prog * prog , struct bpf_iter_meta * meta ,
struct sock_common * sk_common , uid_t uid )
{
struct bpf_iter__tcp ctx ;
meta - > seq_num - - ; /* skip SEQ_START_TOKEN */
ctx . meta = meta ;
ctx . sk_common = sk_common ;
ctx . uid = uid ;
return bpf_iter_run_prog ( prog , & ctx ) ;
}
2021-07-01 13:06:13 -07:00
static void bpf_iter_tcp_put_batch ( struct bpf_tcp_iter_state * iter )
{
while ( iter - > cur_sk < iter - > end_sk )
2023-03-27 17:42:32 -07:00
sock_gen_put ( iter - > batch [ iter - > cur_sk + + ] ) ;
2021-07-01 13:06:13 -07:00
}
static int bpf_iter_tcp_realloc_batch ( struct bpf_tcp_iter_state * iter ,
unsigned int new_batch_sz )
{
struct sock * * new_batch ;
new_batch = kvmalloc ( sizeof ( * new_batch ) * new_batch_sz ,
GFP_USER | __GFP_NOWARN ) ;
if ( ! new_batch )
return - ENOMEM ;
bpf_iter_tcp_put_batch ( iter ) ;
kvfree ( iter - > batch ) ;
iter - > batch = new_batch ;
iter - > max_sk = new_batch_sz ;
return 0 ;
}
static unsigned int bpf_iter_tcp_listening_batch ( struct seq_file * seq ,
struct sock * start_sk )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = seq - > private ;
struct tcp_iter_state * st = & iter - > state ;
2022-05-11 17:06:05 -07:00
struct hlist_nulls_node * node ;
2021-07-01 13:06:13 -07:00
unsigned int expected = 1 ;
struct sock * sk ;
sock_hold ( start_sk ) ;
iter - > batch [ iter - > end_sk + + ] = start_sk ;
2022-05-11 17:06:05 -07:00
sk = sk_nulls_next ( start_sk ) ;
sk_nulls_for_each_from ( sk , node ) {
2021-07-01 13:06:13 -07:00
if ( seq_sk_match ( seq , sk ) ) {
if ( iter - > end_sk < iter - > max_sk ) {
sock_hold ( sk ) ;
iter - > batch [ iter - > end_sk + + ] = sk ;
}
expected + + ;
}
}
2022-09-07 18:10:20 -07:00
spin_unlock ( & hinfo - > lhash2 [ st - > bucket ] . lock ) ;
2021-07-01 13:06:13 -07:00
return expected ;
}
static unsigned int bpf_iter_tcp_established_batch ( struct seq_file * seq ,
struct sock * start_sk )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = seq - > private ;
struct tcp_iter_state * st = & iter - > state ;
struct hlist_nulls_node * node ;
unsigned int expected = 1 ;
struct sock * sk ;
sock_hold ( start_sk ) ;
iter - > batch [ iter - > end_sk + + ] = start_sk ;
sk = sk_nulls_next ( start_sk ) ;
sk_nulls_for_each_from ( sk , node ) {
if ( seq_sk_match ( seq , sk ) ) {
if ( iter - > end_sk < iter - > max_sk ) {
sock_hold ( sk ) ;
iter - > batch [ iter - > end_sk + + ] = sk ;
}
expected + + ;
}
}
2022-09-07 18:10:20 -07:00
spin_unlock_bh ( inet_ehash_lockp ( hinfo , st - > bucket ) ) ;
2021-07-01 13:06:13 -07:00
return expected ;
}
static struct sock * bpf_iter_tcp_batch ( struct seq_file * seq )
{
2022-09-07 18:10:20 -07:00
struct inet_hashinfo * hinfo = seq_file_net ( seq ) - > ipv4 . tcp_death_row . hashinfo ;
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = seq - > private ;
struct tcp_iter_state * st = & iter - > state ;
unsigned int expected ;
bool resized = false ;
struct sock * sk ;
/* The st->bucket is done. Directly advance to the next
* bucket instead of having the tcp_seek_last_pos ( ) to skip
* one by one in the current bucket and eventually find out
* it has to advance to the next bucket .
*/
if ( iter - > st_bucket_done ) {
st - > offset = 0 ;
st - > bucket + + ;
if ( st - > state = = TCP_SEQ_STATE_LISTENING & &
2022-09-07 18:10:20 -07:00
st - > bucket > hinfo - > lhash2_mask ) {
2021-07-01 13:06:13 -07:00
st - > state = TCP_SEQ_STATE_ESTABLISHED ;
st - > bucket = 0 ;
}
}
again :
/* Get a new batch */
iter - > cur_sk = 0 ;
iter - > end_sk = 0 ;
iter - > st_bucket_done = false ;
sk = tcp_seek_last_pos ( seq ) ;
if ( ! sk )
return NULL ; /* Done */
if ( st - > state = = TCP_SEQ_STATE_LISTENING )
expected = bpf_iter_tcp_listening_batch ( seq , sk ) ;
else
expected = bpf_iter_tcp_established_batch ( seq , sk ) ;
if ( iter - > end_sk = = expected ) {
iter - > st_bucket_done = true ;
return sk ;
}
if ( ! resized & & ! bpf_iter_tcp_realloc_batch ( iter , expected * 3 / 2 ) ) {
resized = true ;
goto again ;
}
return sk ;
}
static void * bpf_iter_tcp_seq_start ( struct seq_file * seq , loff_t * pos )
{
/* bpf iter does not support lseek, so it always
* continue from where it was stop ( ) - ped .
*/
if ( * pos )
return bpf_iter_tcp_batch ( seq ) ;
return SEQ_START_TOKEN ;
}
static void * bpf_iter_tcp_seq_next ( struct seq_file * seq , void * v , loff_t * pos )
{
struct bpf_tcp_iter_state * iter = seq - > private ;
struct tcp_iter_state * st = & iter - > state ;
struct sock * sk ;
/* Whenever seq_next() is called, the iter->cur_sk is
* done with seq_show ( ) , so advance to the next sk in
* the batch .
*/
if ( iter - > cur_sk < iter - > end_sk ) {
/* Keeping st->num consistent in tcp_iter_state.
* bpf_iter_tcp does not use st - > num .
* meta . seq_num is used instead .
*/
st - > num + + ;
/* Move st->offset to the next sk in the bucket such that
* the future start ( ) will resume at st - > offset in
* st - > bucket . See tcp_seek_last_pos ( ) .
*/
st - > offset + + ;
2023-03-27 17:42:32 -07:00
sock_gen_put ( iter - > batch [ iter - > cur_sk + + ] ) ;
2021-07-01 13:06:13 -07:00
}
if ( iter - > cur_sk < iter - > end_sk )
sk = iter - > batch [ iter - > cur_sk ] ;
else
sk = bpf_iter_tcp_batch ( seq ) ;
+ + * pos ;
/* Keeping st->last_pos consistent in tcp_iter_state.
* bpf iter does not do lseek , so st - > last_pos always equals to * pos .
*/
st - > last_pos = * pos ;
return sk ;
}
2020-06-23 16:08:05 -07:00
static int bpf_iter_tcp_seq_show ( struct seq_file * seq , void * v )
{
struct bpf_iter_meta meta ;
struct bpf_prog * prog ;
struct sock * sk = v ;
uid_t uid ;
2021-07-01 13:06:13 -07:00
int ret ;
2020-06-23 16:08:05 -07:00
if ( v = = SEQ_START_TOKEN )
return 0 ;
2021-07-01 13:06:13 -07:00
if ( sk_fullsock ( sk ) )
2023-05-19 22:51:49 +00:00
lock_sock ( sk ) ;
2021-07-01 13:06:13 -07:00
if ( unlikely ( sk_unhashed ( sk ) ) ) {
ret = SEQ_SKIP ;
goto unlock ;
}
2020-06-23 16:08:05 -07:00
if ( sk - > sk_state = = TCP_TIME_WAIT ) {
uid = 0 ;
} else if ( sk - > sk_state = = TCP_NEW_SYN_RECV ) {
const struct request_sock * req = v ;
uid = from_kuid_munged ( seq_user_ns ( seq ) ,
sock_i_uid ( req - > rsk_listener ) ) ;
} else {
uid = from_kuid_munged ( seq_user_ns ( seq ) , sock_i_uid ( sk ) ) ;
}
meta . seq = seq ;
prog = bpf_iter_get_info ( & meta , false ) ;
2021-07-01 13:06:13 -07:00
ret = tcp_prog_seq_show ( prog , & meta , v , uid ) ;
unlock :
if ( sk_fullsock ( sk ) )
2023-05-19 22:51:49 +00:00
release_sock ( sk ) ;
2021-07-01 13:06:13 -07:00
return ret ;
2020-06-23 16:08:05 -07:00
}
static void bpf_iter_tcp_seq_stop ( struct seq_file * seq , void * v )
{
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = seq - > private ;
2020-06-23 16:08:05 -07:00
struct bpf_iter_meta meta ;
struct bpf_prog * prog ;
if ( ! v ) {
meta . seq = seq ;
prog = bpf_iter_get_info ( & meta , true ) ;
if ( prog )
( void ) tcp_prog_seq_show ( prog , & meta , v , 0 ) ;
}
2021-07-01 13:06:13 -07:00
if ( iter - > cur_sk < iter - > end_sk ) {
bpf_iter_tcp_put_batch ( iter ) ;
iter - > st_bucket_done = false ;
}
2020-06-23 16:08:05 -07:00
}
static const struct seq_operations bpf_iter_tcp_seq_ops = {
. show = bpf_iter_tcp_seq_show ,
2021-07-01 13:06:13 -07:00
. start = bpf_iter_tcp_seq_start ,
. next = bpf_iter_tcp_seq_next ,
2020-06-23 16:08:05 -07:00
. stop = bpf_iter_tcp_seq_stop ,
} ;
# endif
2021-07-01 13:05:48 -07:00
static unsigned short seq_file_family ( const struct seq_file * seq )
{
2021-07-01 13:05:54 -07:00
const struct tcp_seq_afinfo * afinfo ;
2021-07-01 13:05:48 -07:00
2021-07-01 13:05:54 -07:00
# ifdef CONFIG_BPF_SYSCALL
2021-07-01 13:05:48 -07:00
/* Iterated from bpf_iter. Let the bpf prog to filter instead. */
2021-07-01 13:05:54 -07:00
if ( seq - > op = = & bpf_iter_tcp_seq_ops )
2021-07-01 13:05:48 -07:00
return AF_UNSPEC ;
2020-06-23 16:08:05 -07:00
# endif
2021-07-01 13:05:48 -07:00
/* Iterated from proc fs */
2022-01-21 22:14:23 -08:00
afinfo = pde_data ( file_inode ( seq - > file ) ) ;
2021-07-01 13:05:48 -07:00
return afinfo - > family ;
}
2020-06-23 16:08:05 -07:00
2018-04-11 09:31:28 +02:00
static const struct seq_operations tcp4_seq_ops = {
. show = tcp4_seq_show ,
. start = tcp_seq_start ,
. next = tcp_seq_next ,
. stop = tcp_seq_stop ,
} ;
2005-04-16 15:20:36 -07:00
static struct tcp_seq_afinfo tcp4_seq_afinfo = {
. family = AF_INET ,
} ;
2010-01-17 03:35:32 +00:00
static int __net_init tcp4_proc_init_net ( struct net * net )
2008-03-24 14:56:02 -07:00
{
2018-04-10 19:42:55 +02:00
if ( ! proc_create_net_data ( " tcp " , 0444 , net - > proc_net , & tcp4_seq_ops ,
sizeof ( struct tcp_iter_state ) , & tcp4_seq_afinfo ) )
2018-04-11 09:31:28 +02:00
return - ENOMEM ;
return 0 ;
2008-03-24 14:56:02 -07:00
}
2010-01-17 03:35:32 +00:00
static void __net_exit tcp4_proc_exit_net ( struct net * net )
2008-03-24 14:56:02 -07:00
{
2018-04-11 09:31:28 +02:00
remove_proc_entry ( " tcp " , net - > proc_net ) ;
2008-03-24 14:56:02 -07:00
}
static struct pernet_operations tcp4_net_ops = {
. init = tcp4_proc_init_net ,
. exit = tcp4_proc_exit_net ,
} ;
2005-04-16 15:20:36 -07:00
int __init tcp4_proc_init ( void )
{
2008-03-24 14:56:02 -07:00
return register_pernet_subsys ( & tcp4_net_ops ) ;
2005-04-16 15:20:36 -07:00
}
void tcp4_proc_exit ( void )
{
2008-03-24 14:56:02 -07:00
unregister_pernet_subsys ( & tcp4_net_ops ) ;
2005-04-16 15:20:36 -07:00
}
# endif /* CONFIG_PROC_FS */
2020-11-13 07:08:08 -08:00
/* @wake is one when sk_stream_write_space() calls us.
* This sends EPOLLOUT only if notsent_bytes is half the limit .
* This mimics the strategy used in sock_def_write_space ( ) .
*/
bool tcp_stream_memory_free ( const struct sock * sk , int wake )
{
const struct tcp_sock * tp = tcp_sk ( sk ) ;
u32 notsent_bytes = READ_ONCE ( tp - > write_seq ) -
READ_ONCE ( tp - > snd_nxt ) ;
return ( notsent_bytes < < wake ) < tcp_notsent_lowat ( tp ) ;
}
EXPORT_SYMBOL ( tcp_stream_memory_free ) ;
2005-04-16 15:20:36 -07:00
struct proto tcp_prot = {
. name = " TCP " ,
. owner = THIS_MODULE ,
. close = tcp_close ,
2018-03-30 15:08:05 -07:00
. pre_connect = tcp_v4_pre_connect ,
2005-04-16 15:20:36 -07:00
. connect = tcp_v4_connect ,
. disconnect = tcp_disconnect ,
2005-08-09 20:10:42 -07:00
. accept = inet_csk_accept ,
2005-04-16 15:20:36 -07:00
. ioctl = tcp_ioctl ,
. init = tcp_v4_init_sock ,
. destroy = tcp_v4_destroy_sock ,
. shutdown = tcp_shutdown ,
. setsockopt = tcp_setsockopt ,
. getsockopt = tcp_getsockopt ,
2021-01-15 08:34:59 -08:00
. bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt ,
2017-01-09 16:55:12 +01:00
. keepalive = tcp_set_keepalive ,
2005-04-16 15:20:36 -07:00
. recvmsg = tcp_recvmsg ,
2010-07-10 20:41:55 +00:00
. sendmsg = tcp_sendmsg ,
2023-06-07 19:19:13 +01:00
. splice_eof = tcp_splice_eof ,
2005-04-16 15:20:36 -07:00
. backlog_rcv = tcp_v4_do_rcv ,
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 05:50:31 +00:00
. release_cb = tcp_release_cb ,
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 04:06:04 -08:00
. hash = inet_hash ,
. unhash = inet_unhash ,
. get_port = inet_csk_get_port ,
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06 21:20:20 +08:00
. put_port = inet_put_port ,
2021-03-30 19:32:31 -07:00
# ifdef CONFIG_BPF_SYSCALL
. psock_update_sk_prot = tcp_bpf_update_proto ,
# endif
2005-04-16 15:20:36 -07:00
. enter_memory_pressure = tcp_enter_memory_pressure ,
2017-06-07 13:29:12 -07:00
. leave_memory_pressure = tcp_leave_memory_pressure ,
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 20:27:07 -07:00
. stream_memory_free = tcp_stream_memory_free ,
2005-04-16 15:20:36 -07:00
. sockets_allocated = & tcp_sockets_allocated ,
2005-08-09 20:11:41 -07:00
. orphan_count = & tcp_orphan_count ,
2022-06-08 23:34:08 -07:00
2005-04-16 15:20:36 -07:00
. memory_allocated = & tcp_memory_allocated ,
2022-06-08 23:34:08 -07:00
. per_cpu_fw_alloc = & tcp_memory_per_cpu_fw_alloc ,
2005-04-16 15:20:36 -07:00
. memory_pressure = & tcp_memory_pressure ,
2013-10-19 16:25:36 -07:00
. sysctl_mem = sysctl_tcp_mem ,
2017-11-07 00:29:28 -08:00
. sysctl_wmem_offset = offsetof ( struct net , ipv4 . sysctl_tcp_wmem ) ,
. sysctl_rmem_offset = offsetof ( struct net , ipv4 . sysctl_tcp_rmem ) ,
2005-04-16 15:20:36 -07:00
. max_header = MAX_TCP_HEADER ,
. obj_size = sizeof ( struct tcp_sock ) ,
2017-01-18 02:53:44 -08:00
. slab_flags = SLAB_TYPESAFE_BY_RCU ,
2005-12-13 23:25:19 -08:00
. twsk_prot = & tcp_timewait_sock_ops ,
2005-06-18 22:47:21 -07:00
. rsk_prot = & tcp_request_sock_ops ,
2022-09-07 18:10:19 -07:00
. h . hashinfo = NULL ,
2010-07-10 20:41:55 +00:00
. no_autobind = true ,
2015-12-16 12:30:05 +09:00
. diag_destroy = tcp_abort ,
2005-04-16 15:20:36 -07:00
} ;
2010-07-09 21:22:10 +00:00
EXPORT_SYMBOL ( tcp_prot ) ;
2005-04-16 15:20:36 -07:00
2015-01-29 21:35:05 -08:00
static void __net_exit tcp_sk_exit ( struct net * net )
{
2019-04-01 16:04:53 +08:00
if ( net - > ipv4 . tcp_congestion_control )
2020-01-08 16:35:08 -08:00
bpf_module_put ( net - > ipv4 . tcp_congestion_control ,
net - > ipv4 . tcp_congestion_control - > owner ) ;
2015-01-29 21:35:05 -08:00
}
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
static void __net_init tcp_set_hashinfo ( struct net * net )
2008-04-03 14:31:33 -07:00
{
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
struct inet_hashinfo * hinfo ;
unsigned int ehash_entries ;
struct net * old_net ;
if ( net_eq ( net , & init_net ) )
goto fallback ;
old_net = current - > nsproxy - > net_ns ;
ehash_entries = READ_ONCE ( old_net - > ipv4 . sysctl_tcp_child_ehash_entries ) ;
if ( ! ehash_entries )
goto fallback ;
ehash_entries = roundup_pow_of_two ( ehash_entries ) ;
hinfo = inet_pernet_hashinfo_alloc ( & tcp_hashinfo , ehash_entries ) ;
if ( ! hinfo ) {
pr_warn ( " Failed to allocate TCP ehash (entries: %u) "
" for a netns, fallback to the global one \n " ,
ehash_entries ) ;
fallback :
hinfo = & tcp_hashinfo ;
ehash_entries = tcp_hashinfo . ehash_mask + 1 ;
}
net - > ipv4 . tcp_death_row . hashinfo = hinfo ;
net - > ipv4 . tcp_death_row . sysctl_max_tw_buckets = ehash_entries / 2 ;
net - > ipv4 . sysctl_max_syn_backlog = max ( 128U , ehash_entries / 128 ) ;
}
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
static int __net_init tcp_sk_init ( struct net * net )
{
2013-01-05 16:10:48 +00:00
net - > ipv4 . sysctl_tcp_ecn = 2 ;
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
net - > ipv4 . sysctl_tcp_ecn_fallback = 1 ;
2015-02-10 09:53:16 +08:00
net - > ipv4 . sysctl_tcp_base_mss = TCP_BASE_MSS ;
2019-06-06 09:15:31 -07:00
net - > ipv4 . sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS ;
2015-03-06 11:18:23 +08:00
net - > ipv4 . sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD ;
2015-03-06 11:18:24 +08:00
net - > ipv4 . sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL ;
2019-08-07 19:52:29 -04:00
net - > ipv4 . sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS ;
2008-04-03 14:31:33 -07:00
2016-01-07 16:38:43 +02:00
net - > ipv4 . sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME ;
2016-01-07 16:38:44 +02:00
net - > ipv4 . sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES ;
2016-01-07 16:38:45 +02:00
net - > ipv4 . sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL ;
2016-01-07 16:38:43 +02:00
2016-02-03 09:46:49 +02:00
net - > ipv4 . sysctl_tcp_syn_retries = TCP_SYN_RETRIES ;
2016-02-03 09:46:50 +02:00
net - > ipv4 . sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES ;
2016-02-08 04:24:33 -05:00
net - > ipv4 . sysctl_tcp_syncookies = 1 ;
2016-02-03 09:46:52 +02:00
net - > ipv4 . sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH ;
2016-02-03 09:46:53 +02:00
net - > ipv4 . sysctl_tcp_retries1 = TCP_RETR1 ;
2016-02-03 09:46:54 +02:00
net - > ipv4 . sysctl_tcp_retries2 = TCP_RETR2 ;
2016-02-03 09:46:55 +02:00
net - > ipv4 . sysctl_tcp_orphan_retries = 0 ;
2016-02-03 09:46:56 +02:00
net - > ipv4 . sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT ;
2016-02-03 09:46:57 +02:00
net - > ipv4 . sysctl_tcp_notsent_lowat = UINT_MAX ;
2018-06-03 10:41:17 -07:00
net - > ipv4 . sysctl_tcp_tw_reuse = 2 ;
2019-12-09 14:19:59 -05:00
net - > ipv4 . sysctl_tcp_no_ssthresh_metrics_save = 1 ;
2016-02-03 09:46:51 +02:00
2022-09-07 18:10:18 -07:00
refcount_set ( & net - > ipv4 . tcp_death_row . tw_refcount , 1 ) ;
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
tcp_set_hashinfo ( net ) ;
2016-12-28 17:52:32 +08:00
2017-06-07 10:34:37 -07:00
net - > ipv4 . sysctl_tcp_sack = 1 ;
2017-06-07 10:34:38 -07:00
net - > ipv4 . sysctl_tcp_window_scaling = 1 ;
2017-06-07 10:34:39 -07:00
net - > ipv4 . sysctl_tcp_timestamps = 1 ;
2017-10-26 21:54:56 -07:00
net - > ipv4 . sysctl_tcp_early_retrans = 3 ;
2017-10-26 21:54:57 -07:00
net - > ipv4 . sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION ;
2017-10-26 21:54:59 -07:00
net - > ipv4 . sysctl_tcp_slow_start_after_idle = 1 ; /* By default, RFC2861 behavior. */
2017-10-26 21:55:00 -07:00
net - > ipv4 . sysctl_tcp_retrans_collapse = 1 ;
2017-10-26 21:55:06 -07:00
net - > ipv4 . sysctl_tcp_max_reordering = 300 ;
2017-10-26 21:55:07 -07:00
net - > ipv4 . sysctl_tcp_dsack = 1 ;
2017-10-26 21:55:08 -07:00
net - > ipv4 . sysctl_tcp_app_win = 31 ;
2017-10-26 21:55:09 -07:00
net - > ipv4 . sysctl_tcp_adv_win_scale = 1 ;
2017-10-26 21:55:10 -07:00
net - > ipv4 . sysctl_tcp_frto = 2 ;
2017-10-27 07:47:22 -07:00
net - > ipv4 . sysctl_tcp_moderate_rcvbuf = 1 ;
2017-10-27 07:47:23 -07:00
/* This limits the percentage of the congestion window which we
* will allow a single TSO frame to consume . Building TSO frames
* which are too large can cause TCP streams to be bursty .
*/
net - > ipv4 . sysctl_tcp_tso_win_divisor = 3 ;
2018-11-11 07:34:28 -08:00
/* Default TSQ limit of 16 TSO segments */
net - > ipv4 . sysctl_tcp_limit_output_bytes = 16 * 65536 ;
2022-08-30 11:56:56 -07:00
/* rfc5961 challenge ack rate limiting, per net-ns, disabled by default. */
net - > ipv4 . sysctl_tcp_challenge_ack_limit = INT_MAX ;
2017-10-27 07:47:27 -07:00
net - > ipv4 . sysctl_tcp_min_tso_segs = 2 ;
tcp: adjust TSO packet sizes based on min_rtt
Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.
The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.
Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()
Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.
This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.
/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.
Tested:
100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.
Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96005
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
65,945.29 msec task-clock # 2.845 CPUs utilized
1,314,632 context-switches # 19935.279 M/sec
5,292 cpu-migrations # 80.249 M/sec
940,641 page-faults # 14264.023 M/sec
201,117,030,926 cycles # 3049769.216 GHz (83.45%)
17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%)
136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%)
53,809,530,436 instructions # 0.27 insn per cycle
# 2.54 stalled cycles per insn (83.36%)
9,062,315,523 branches # 137422329.563 M/sec (83.22%)
153,008,621 branch-misses # 1.69% of all branches (83.32%)
23.182970846 seconds time elapsed
TcpInSegs 15648792 0.0
TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered 58654791 0.0
TcpExtTCPDeliveredCE 19 0.0
After patch:
~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96046
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
48,982.58 msec task-clock # 2.104 CPUs utilized
186,014 context-switches # 3797.599 M/sec
3,109 cpu-migrations # 63.472 M/sec
941,180 page-faults # 19214.814 M/sec
153,459,763,868 cycles # 3132982.807 GHz (83.56%)
12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%)
120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%)
36,803,672,106 instructions # 0.24 insn per cycle
# 3.27 stalled cycles per insn (83.18%)
5,947,266,275 branches # 121417383.427 M/sec (83.64%)
87,984,616 branch-misses # 1.48% of all branches (83.43%)
23.281200256 seconds time elapsed
TcpInSegs 1434706 0.0
TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered 58878971 0.0
TcpExtTCPDeliveredCE 9664 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08 17:57:57 -08:00
net - > ipv4 . sysctl_tcp_tso_rtt_log = 9 ; /* 2^9 = 512 usec */
2017-10-27 07:47:28 -07:00
net - > ipv4 . sysctl_tcp_min_rtt_wlen = 300 ;
2017-10-27 07:47:29 -07:00
net - > ipv4 . sysctl_tcp_autocorking = 1 ;
2017-10-27 07:47:30 -07:00
net - > ipv4 . sysctl_tcp_invalid_ratelimit = HZ / 2 ;
2017-10-27 07:47:31 -07:00
net - > ipv4 . sysctl_tcp_pacing_ss_ratio = 200 ;
2017-10-27 07:47:32 -07:00
net - > ipv4 . sysctl_tcp_pacing_ca_ratio = 120 ;
2017-11-07 00:29:28 -08:00
if ( net ! = & init_net ) {
memcpy ( net - > ipv4 . sysctl_tcp_rmem ,
init_net . ipv4 . sysctl_tcp_rmem ,
sizeof ( init_net . ipv4 . sysctl_tcp_rmem ) ) ;
memcpy ( net - > ipv4 . sysctl_tcp_wmem ,
init_net . ipv4 . sysctl_tcp_wmem ,
sizeof ( init_net . ipv4 . sysctl_tcp_wmem ) ) ;
}
2018-05-17 14:47:28 -07:00
net - > ipv4 . sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC ;
2020-04-30 10:35:43 -07:00
net - > ipv4 . sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC ;
2018-05-17 14:47:29 -07:00
net - > ipv4 . sysctl_tcp_comp_sack_nr = 44 ;
2017-09-27 11:35:40 +08:00
net - > ipv4 . sysctl_tcp_fastopen = TFO_CLIENT_ENABLE ;
2021-07-21 10:27:38 -07:00
net - > ipv4 . sysctl_tcp_fastopen_blackhole_timeout = 0 ;
2017-09-27 11:35:43 +08:00
atomic_set ( & net - > ipv4 . tfo_active_disable_times , 0 ) ;
2017-09-27 11:35:40 +08:00
2022-10-26 13:51:11 +00:00
/* Set default values for PLB */
net - > ipv4 . sysctl_tcp_plb_enabled = 0 ; /* Disabled by default */
net - > ipv4 . sysctl_tcp_plb_idle_rehash_rounds = 3 ;
net - > ipv4 . sysctl_tcp_plb_rehash_rounds = 12 ;
net - > ipv4 . sysctl_tcp_plb_suspend_rto_sec = 60 ;
/* Default congestion threshold for PLB to mark a round is 50% */
2022-10-26 13:51:12 +00:00
net - > ipv4 . sysctl_tcp_plb_cong_thresh = ( 1 < < TCP_PLB_SCALE ) / 2 ;
2022-10-26 13:51:11 +00:00
2017-11-14 08:25:49 -08:00
/* Reno is always built in */
if ( ! net_eq ( net , & init_net ) & &
2020-01-08 16:35:08 -08:00
bpf_try_module_get ( init_net . ipv4 . tcp_congestion_control ,
init_net . ipv4 . tcp_congestion_control - > owner ) )
2017-11-14 08:25:49 -08:00
net - > ipv4 . tcp_congestion_control = init_net . ipv4 . tcp_congestion_control ;
else
net - > ipv4 . tcp_congestion_control = & tcp_reno ;
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
net - > ipv4 . sysctl_tcp_syn_linear_timeouts = 4 ;
tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.
To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].
As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session. A receive buffer full condition should instead result
in a zero window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.
But this problem does show up for other types of flows. Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time. In these cases, we directly
encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window
Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323
Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-11 22:05:24 -05:00
net - > ipv4 . sysctl_tcp_shrink_window = 0 ;
tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 21:04:22 +02:00
return 0 ;
2009-12-03 02:29:09 +00:00
}
static void __net_exit tcp_sk_exit_batch ( struct list_head * net_exit_list )
{
2017-09-27 11:35:42 +08:00
struct net * net ;
2022-09-07 18:10:21 -07:00
tcp_twsk_purge ( net_exit_list , AF_INET ) ;
2022-05-12 14:14:56 -07:00
2022-09-07 18:10:18 -07:00
list_for_each_entry ( net , net_exit_list , exit_list ) {
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-07 18:10:22 -07:00
inet_pernet_hashinfo_free ( net - > ipv4 . tcp_death_row . hashinfo ) ;
2022-09-07 18:10:18 -07:00
WARN_ON_ONCE ( ! refcount_dec_and_test ( & net - > ipv4 . tcp_death_row . tw_refcount ) ) ;
2017-09-27 11:35:42 +08:00
tcp_fastopen_ctx_destroy ( net ) ;
2022-09-07 18:10:18 -07:00
}
2008-04-03 14:31:33 -07:00
}
static struct pernet_operations __net_initdata tcp_sk_ops = {
2009-12-03 02:29:09 +00:00
. init = tcp_sk_init ,
. exit = tcp_sk_exit ,
. exit_batch = tcp_sk_exit_batch ,
2008-04-03 14:31:33 -07:00
} ;
2020-06-23 16:08:05 -07:00
# if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
DEFINE_BPF_ITER_FUNC ( tcp , struct bpf_iter_meta * meta ,
struct sock_common * sk_common , uid_t uid )
2021-07-01 13:06:13 -07:00
# define INIT_BATCH_SZ 16
2020-07-23 11:41:10 -07:00
static int bpf_iter_init_tcp ( void * priv_data , struct bpf_iter_aux_info * aux )
2020-06-23 16:08:05 -07:00
{
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = priv_data ;
int err ;
2020-06-23 16:08:05 -07:00
2021-07-01 13:06:13 -07:00
err = bpf_iter_init_seq_net ( priv_data , aux ) ;
if ( err )
return err ;
2020-06-23 16:08:05 -07:00
2021-07-01 13:06:13 -07:00
err = bpf_iter_tcp_realloc_batch ( iter , INIT_BATCH_SZ ) ;
if ( err ) {
bpf_iter_fini_seq_net ( priv_data ) ;
return err ;
}
return 0 ;
2020-06-23 16:08:05 -07:00
}
static void bpf_iter_fini_tcp ( void * priv_data )
{
2021-07-01 13:06:13 -07:00
struct bpf_tcp_iter_state * iter = priv_data ;
2020-06-23 16:08:05 -07:00
bpf_iter_fini_seq_net ( priv_data ) ;
2021-07-01 13:06:13 -07:00
kvfree ( iter - > batch ) ;
2020-06-23 16:08:05 -07:00
}
2020-07-23 11:41:09 -07:00
static const struct bpf_iter_seq_info tcp_seq_info = {
2020-06-23 16:08:05 -07:00
. seq_ops = & bpf_iter_tcp_seq_ops ,
. init_seq_private = bpf_iter_init_tcp ,
. fini_seq_private = bpf_iter_fini_tcp ,
2021-07-01 13:06:13 -07:00
. seq_priv_size = sizeof ( struct bpf_tcp_iter_state ) ,
2020-07-23 11:41:09 -07:00
} ;
2021-07-01 13:06:19 -07:00
static const struct bpf_func_proto *
bpf_iter_tcp_get_func_proto ( enum bpf_func_id func_id ,
const struct bpf_prog * prog )
{
switch ( func_id ) {
case BPF_FUNC_setsockopt :
return & bpf_sk_setsockopt_proto ;
case BPF_FUNC_getsockopt :
return & bpf_sk_getsockopt_proto ;
default :
return NULL ;
}
}
2020-07-23 11:41:09 -07:00
static struct bpf_iter_reg tcp_reg_info = {
. target = " tcp " ,
2020-06-23 16:08:05 -07:00
. ctx_arg_info_size = 1 ,
. ctx_arg_info = {
{ offsetof ( struct bpf_iter__tcp , sk_common ) ,
bpf: Add bpf_sock_destroy kfunc
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.
The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.
The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.
Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 22:51:55 +00:00
PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED } ,
2020-06-23 16:08:05 -07:00
} ,
2021-07-01 13:06:19 -07:00
. get_func_proto = bpf_iter_tcp_get_func_proto ,
2020-07-23 11:41:09 -07:00
. seq_info = & tcp_seq_info ,
2020-06-23 16:08:05 -07:00
} ;
static void __init bpf_iter_register ( void )
{
2020-07-20 09:34:03 -07:00
tcp_reg_info . ctx_arg_info [ 0 ] . btf_id = btf_sock_ids [ BTF_SOCK_TYPE_SOCK_COMMON ] ;
2020-06-23 16:08:05 -07:00
if ( bpf_iter_reg_target ( & tcp_reg_info ) )
pr_warn ( " Warning: could not register bpf iterator tcp \n " ) ;
}
# endif
2008-02-29 11:13:15 -08:00
void __init tcp_v4_init ( void )
2005-04-16 15:20:36 -07:00
{
2022-01-24 12:24:57 -08:00
int cpu , res ;
for_each_possible_cpu ( cpu ) {
struct sock * sk ;
res = inet_ctl_sock_create ( & sk , PF_INET , SOCK_RAW ,
IPPROTO_TCP , & init_net ) ;
if ( res )
panic ( " Failed to create the TCP control socket. \n " ) ;
sock_set_flag ( sk , SOCK_USE_WRITE_QUEUE ) ;
/* Please enforce IP_DF and IPID==0 for RST and
* ACK sent in SYN - RECV and TIME - WAIT state .
*/
inet_sk ( sk ) - > pmtudisc = IP_PMTUDISC_DO ;
per_cpu ( ipv4_tcp_sk , cpu ) = sk ;
}
2009-02-22 00:10:18 -08:00
if ( register_pernet_subsys ( & tcp_sk_ops ) )
2005-04-16 15:20:36 -07:00
panic ( " Failed to create the TCP control socket. \n " ) ;
2020-06-23 16:08:05 -07:00
# if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
bpf_iter_register ( ) ;
# endif
2005-04-16 15:20:36 -07:00
}