2019-05-19 13:08:55 +01:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-16 15:20:36 -07:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Implementation of the Transmission Control Protocol ( TCP ) .
*
2005-05-05 16:16:16 -07:00
* Authors : Ross Biro
2005-04-16 15:20:36 -07:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Mark Evans , < evansmp @ uhura . aston . ac . uk >
* Corey Minyard < wf - rch ! minyard @ relay . EU . net >
* Florian La Roche , < flla @ stud . uni - sb . de >
* Charles Hedrick , < hedrick @ klinzhai . rutgers . edu >
* Linus Torvalds , < torvalds @ cs . helsinki . fi >
* Alan Cox , < gw4pts @ gw4pts . ampr . org >
* Matthew Dillon , < dillon @ apollo . west . oic . com >
* Arnt Gulbrandsen , < agulbra @ nvg . unit . no >
* Jorge Cwik , < jorge @ laser . satlink . net >
*/
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/gfp.h>
2005-04-16 15:20:36 -07:00
# include <net/tcp.h>
2018-07-19 11:14:44 +10:00
static u32 tcp_clamp_rto_to_user_timeout ( const struct sock * sk )
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2023-08-04 14:46:12 +00:00
u32 elapsed , start_ts , user_timeout ;
2018-11-24 09:12:24 -08:00
s32 remaining ;
2018-07-19 11:14:44 +10:00
2019-01-16 15:05:30 -08:00
start_ts = tcp_sk ( sk ) - > retrans_stamp ;
2023-08-04 14:46:12 +00:00
user_timeout = READ_ONCE ( icsk - > icsk_user_timeout ) ;
if ( ! user_timeout )
2018-07-19 11:14:44 +10:00
return icsk - > icsk_rto ;
elapsed = tcp_time_stamp ( tcp_sk ( sk ) ) - start_ts ;
2023-08-04 14:46:12 +00:00
remaining = user_timeout - elapsed ;
2018-11-24 09:12:24 -08:00
if ( remaining < = 0 )
2018-07-19 11:14:44 +10:00
return 1 ; /* user timeout has passed; fire ASAP */
2018-11-24 09:12:24 -08:00
return min_t ( u32 , icsk - > icsk_rto , msecs_to_jiffies ( remaining ) ) ;
2018-07-19 11:14:44 +10:00
}
2021-01-22 11:13:06 -08:00
u32 tcp_clamp_probe0_to_user_timeout ( const struct sock * sk , u32 when )
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2023-08-04 14:46:12 +00:00
u32 remaining , user_timeout ;
2021-01-22 11:13:06 -08:00
s32 elapsed ;
2023-08-04 14:46:12 +00:00
user_timeout = READ_ONCE ( icsk - > icsk_user_timeout ) ;
if ( ! user_timeout | | ! icsk - > icsk_probes_tstamp )
2021-01-22 11:13:06 -08:00
return when ;
elapsed = tcp_jiffies32 - icsk - > icsk_probes_tstamp ;
if ( unlikely ( elapsed < 0 ) )
elapsed = 0 ;
2023-08-04 14:46:12 +00:00
remaining = msecs_to_jiffies ( user_timeout ) - elapsed ;
2021-01-22 11:13:06 -08:00
remaining = max_t ( u32 , remaining , TCP_TIMEOUT_MIN ) ;
return min_t ( u32 , remaining , when ) ;
}
2016-07-16 04:04:34 +02:00
/**
* tcp_write_err ( ) - close socket and save error info
* @ sk : The socket the error has appeared on .
*
* Returns : Nothing ( void )
*/
2005-04-16 15:20:36 -07:00
static void tcp_write_err ( struct sock * sk )
{
2023-03-15 20:57:44 +00:00
WRITE_ONCE ( sk - > sk_err , READ_ONCE ( sk - > sk_err_soft ) ? : ETIMEDOUT ) ;
2021-06-27 18:48:21 -04:00
sk_error_report ( sk ) ;
2005-04-16 15:20:36 -07:00
2018-03-06 17:15:12 -05:00
tcp_write_queue_purge ( sk ) ;
2005-04-16 15:20:36 -07:00
tcp_done ( sk ) ;
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONTIMEOUT ) ;
2005-04-16 15:20:36 -07:00
}
2016-07-16 04:04:34 +02:00
/**
* tcp_out_of_resources ( ) - Close socket if out of resources
* @ sk : pointer to current socket
* @ do_reset : send a last packet with reset flag
2005-04-16 15:20:36 -07:00
*
2016-07-16 04:04:34 +02:00
* Do not allow orphaned sockets to eat all our resources .
* This is direct violation of TCP specs , but it is required
* to prevent DoS attacks . It is called when a retransmission timeout
* or zero probe timeout occurs on orphaned socket .
*
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 16:14:26 -05:00
* Also close if our net namespace is exiting ; in that case there is no
* hope of ever communicating again since all netns interfaces are already
* down ( or about to be down ) , and we need to release our dst references ,
* which have been moved to the netns loopback interface , so the namespace
* can finish exiting . This condition is only possible if we are a kernel
* socket , as those do not hold references to the namespace .
*
2016-07-16 04:04:34 +02:00
* Criteria is still not confirmed experimentally and may change .
* We kill the socket , if :
* 1. If number of orphaned sockets exceeds an administratively configured
* limit .
* 2. If we have strong memory pressure .
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 16:14:26 -05:00
* 3. If our net namespace is exiting .
2005-04-16 15:20:36 -07:00
*/
2014-09-29 13:20:38 -07:00
static int tcp_out_of_resources ( struct sock * sk , bool do_reset )
2005-04-16 15:20:36 -07:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2010-08-25 02:27:49 -07:00
int shift = 0 ;
2005-04-16 15:20:36 -07:00
2007-02-09 23:24:47 +09:00
/* If peer does not open window for long time, or did not transmit
2005-04-16 15:20:36 -07:00
* anything for long time , penalize it . */
2017-05-16 14:00:03 -07:00
if ( ( s32 ) ( tcp_jiffies32 - tp - > lsndtime ) > 2 * TCP_RTO_MAX | | ! do_reset )
2010-08-25 02:27:49 -07:00
shift + + ;
2005-04-16 15:20:36 -07:00
/* If some dubious ICMP arrived, penalize even more. */
2023-03-15 20:57:41 +00:00
if ( READ_ONCE ( sk - > sk_err_soft ) )
2010-08-25 02:27:49 -07:00
shift + + ;
2005-04-16 15:20:36 -07:00
2012-01-30 14:16:06 -08:00
if ( tcp_check_oom ( sk , shift ) ) {
2005-04-16 15:20:36 -07:00
/* Catch exceptional cases, when connection requires reset.
* 1. Last segment was sent recently . */
2017-05-16 14:00:03 -07:00
if ( ( s32 ) ( tcp_jiffies32 - tp - > lsndtime ) < = TCP_TIMEWAIT_LEN | |
2005-04-16 15:20:36 -07:00
/* 2. Window is closed. */
( ! tp - > snd_wnd & & ! tp - > packets_out ) )
2014-09-29 13:20:38 -07:00
do_reset = true ;
2005-04-16 15:20:36 -07:00
if ( do_reset )
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_done ( sk ) ;
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONMEMORY ) ;
2005-04-16 15:20:36 -07:00
return 1 ;
}
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 16:14:26 -05:00
if ( ! check_net ( sock_net ( sk ) ) ) {
/* Not possible to send reset; just close */
tcp_done ( sk ) ;
return 1 ;
}
2005-04-16 15:20:36 -07:00
return 0 ;
}
2016-07-16 04:04:34 +02:00
/**
* tcp_orphan_retries ( ) - Returns maximal number of retries on an orphaned socket
* @ sk : Pointer to the current socket .
* @ alive : bool , socket alive state
*/
2015-10-09 02:41:37 +02:00
static int tcp_orphan_retries ( struct sock * sk , bool alive )
2005-04-16 15:20:36 -07:00
{
2022-07-15 10:17:50 -07:00
int retries = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_orphan_retries ) ; /* May be zero. */
2005-04-16 15:20:36 -07:00
/* We know from an ICMP that something is wrong. */
2023-03-15 20:57:41 +00:00
if ( READ_ONCE ( sk - > sk_err_soft ) & & ! alive )
2005-04-16 15:20:36 -07:00
retries = 0 ;
/* However, if socket sent something recently, select some safe
* number of retries . 8 corresponds to > 100 seconds with minimal
* RTO of 200 msec . */
if ( retries = = 0 & & alive )
retries = 8 ;
return retries ;
}
2007-12-21 01:50:43 -08:00
static void tcp_mtu_probing ( struct inet_connection_sock * icsk , struct sock * sk )
{
2017-11-03 06:09:17 -07:00
const struct net * net = sock_net ( sk ) ;
int mss ;
2015-02-10 09:53:16 +08:00
2007-12-21 01:50:43 -08:00
/* Black hole detection */
2022-07-13 13:52:00 -07:00
if ( ! READ_ONCE ( net - > ipv4 . sysctl_tcp_mtu_probing ) )
2017-11-03 06:09:17 -07:00
return ;
if ( ! icsk - > icsk_mtup . enabled ) {
icsk - > icsk_mtup . enabled = 1 ;
icsk - > icsk_mtup . probe_timestamp = tcp_jiffies32 ;
} else {
mss = tcp_mtu_to_mss ( sk , icsk - > icsk_mtup . search_low ) > > 1 ;
2022-07-13 13:52:01 -07:00
mss = min ( READ_ONCE ( net - > ipv4 . sysctl_tcp_base_mss ) , mss ) ;
2022-07-13 13:52:03 -07:00
mss = max ( mss , READ_ONCE ( net - > ipv4 . sysctl_tcp_mtu_probe_floor ) ) ;
2022-07-13 13:52:02 -07:00
mss = max ( mss , READ_ONCE ( net - > ipv4 . sysctl_tcp_min_snd_mss ) ) ;
2017-11-03 06:09:17 -07:00
icsk - > icsk_mtup . search_low = tcp_mss_to_mtu ( sk , mss ) ;
2007-12-21 01:50:43 -08:00
}
2017-11-03 06:09:17 -07:00
tcp_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
2007-12-21 01:50:43 -08:00
}
2019-01-16 15:05:32 -08:00
static unsigned int tcp_model_timeout ( struct sock * sk ,
unsigned int boundary ,
unsigned int rto_base )
{
unsigned int linear_backoff_thresh , timeout ;
linear_backoff_thresh = ilog2 ( TCP_RTO_MAX / rto_base ) ;
if ( boundary < = linear_backoff_thresh )
timeout = ( ( 2 < < boundary ) - 1 ) * rto_base ;
else
timeout = ( ( 2 < < linear_backoff_thresh ) - 1 ) * rto_base +
( boundary - linear_backoff_thresh ) * TCP_RTO_MAX ;
return jiffies_to_msecs ( timeout ) ;
}
2016-07-16 04:04:34 +02:00
/**
* retransmits_timed_out ( ) - returns true if this connection has timed out
* @ sk : The current socket
* @ boundary : max number of retransmissions
* @ timeout : A custom timeout value .
* If set to 0 the default timeout is calculated and used .
* Using TCP_RTO_MIN and the number of unsuccessful retransmits .
*
* The default " timeout " value this function can calculate and use
* is equivalent to the timeout of a TCP Connection
* after " boundary " unsuccessful , exponentially backed - off
2017-05-23 12:38:35 -07:00
* retransmissions with an initial RTO of TCP_RTO_MIN .
2009-12-07 06:06:16 +00:00
*/
static bool retransmits_timed_out ( struct sock * sk ,
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 19:13:28 +00:00
unsigned int boundary ,
2017-05-23 12:38:35 -07:00
unsigned int timeout )
2009-12-07 06:06:16 +00:00
{
2019-01-16 15:05:32 -08:00
unsigned int start_ts ;
2009-12-07 06:06:16 +00:00
if ( ! inet_csk ( sk ) - > icsk_retransmits )
return false ;
2019-01-16 15:05:30 -08:00
start_ts = tcp_sk ( sk ) - > retrans_stamp ;
2019-09-30 15:44:44 -07:00
if ( likely ( timeout = = 0 ) ) {
unsigned int rto_base = TCP_RTO_MIN ;
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) )
rto_base = tcp_timeout_init ( sk ) ;
timeout = tcp_model_timeout ( sk , boundary , rto_base ) ;
}
2019-01-16 15:05:32 -08:00
2018-11-24 09:12:24 -08:00
return ( s32 ) ( tcp_time_stamp ( tcp_sk ( sk ) ) - start_ts - timeout ) > = 0 ;
2009-12-07 06:06:16 +00:00
}
2005-04-16 15:20:36 -07:00
/* A write timeout has occurred. Process the after effects. */
static int tcp_write_timeout ( struct sock * sk )
{
2006-03-20 17:53:41 -08:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2013-10-29 10:09:05 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2016-02-03 09:46:49 +02:00
struct net * net = sock_net ( sk ) ;
2019-09-26 15:42:51 -07:00
bool expired = false , do_reset ;
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
int retry_until , max_retransmits ;
2005-04-16 15:20:36 -07:00
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) {
2021-01-19 11:26:19 -08:00
if ( icsk - > icsk_retransmits )
__dst_negative_advice ( sk ) ;
2023-08-04 14:46:11 +00:00
/* Paired with WRITE_ONCE() in tcp_sock_set_syncnt() */
retry_until = READ_ONCE ( icsk - > icsk_syn_retries ) ? :
2022-07-15 10:17:46 -07:00
READ_ONCE ( net - > ipv4 . sysctl_tcp_syn_retries ) ;
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
max_retransmits = retry_until ;
if ( sk - > sk_state = = TCP_SYN_SENT )
max_retransmits + = READ_ONCE ( net - > ipv4 . sysctl_tcp_syn_linear_timeouts ) ;
expired = icsk - > icsk_retransmits > = max_retransmits ;
2005-04-16 15:20:36 -07:00
} else {
2022-07-15 10:17:50 -07:00
if ( retransmits_timed_out ( sk , READ_ONCE ( net - > ipv4 . sysctl_tcp_retries1 ) , 0 ) ) {
2006-03-20 17:53:41 -08:00
/* Black hole detection */
2007-12-21 01:50:43 -08:00
tcp_mtu_probing ( icsk , sk ) ;
2005-04-16 15:20:36 -07:00
2021-01-19 11:26:19 -08:00
__dst_negative_advice ( sk ) ;
2005-04-16 15:20:36 -07:00
}
2022-07-15 10:17:50 -07:00
retry_until = READ_ONCE ( net - > ipv4 . sysctl_tcp_retries2 ) ;
2005-04-16 15:20:36 -07:00
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2015-10-09 02:41:37 +02:00
const bool alive = icsk - > icsk_rto < TCP_RTO_MAX ;
2007-02-09 23:24:47 +09:00
2005-04-16 15:20:36 -07:00
retry_until = tcp_orphan_retries ( sk , alive ) ;
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
do_reset = alive | |
2017-05-23 12:38:35 -07:00
! retransmits_timed_out ( sk , retry_until , 0 ) ;
2005-04-16 15:20:36 -07:00
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
if ( tcp_out_of_resources ( sk , do_reset ) )
2005-04-16 15:20:36 -07:00
return 1 ;
}
2019-09-26 15:42:51 -07:00
}
if ( ! expired )
2017-05-23 12:38:35 -07:00
expired = retransmits_timed_out ( sk , retry_until ,
2023-08-04 14:46:12 +00:00
READ_ONCE ( icsk - > icsk_user_timeout ) ) ;
2017-12-12 13:10:40 -08:00
tcp_fastopen_active_detect_blackhole ( sk , expired ) ;
2018-01-25 16:14:11 -08:00
if ( BPF_SOCK_OPS_TEST_FLAG ( tp , BPF_SOCK_OPS_RTO_CB_FLAG ) )
tcp_call_bpf_3arg ( sk , BPF_SOCK_OPS_RTO_CB ,
icsk - > icsk_retransmits ,
icsk - > icsk_rto , ( int ) expired ) ;
2017-05-23 12:38:35 -07:00
if ( expired ) {
2005-04-16 15:20:36 -07:00
/* Has it gone just too far? */
tcp_write_err ( sk ) ;
return 1 ;
}
2018-01-25 16:14:11 -08:00
2021-01-19 11:26:19 -08:00
if ( sk_rethink_txhash ( sk ) ) {
tp - > timeout_rehash + + ;
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPTIMEOUTREHASH ) ;
}
2005-04-16 15:20:36 -07:00
return 0 ;
}
2016-04-29 14:16:47 -07:00
/* Called with BH disabled */
2012-07-20 05:45:50 +00:00
void tcp_delack_timer_handler ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2023-05-31 16:01:50 +08:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) )
return ;
/* Handling the sack compression case */
if ( tp - > compressed_ack ) {
tcp_mstamp_refresh ( tp ) ;
tcp_sack_compress_send_ack ( sk ) ;
return ;
}
2005-04-16 15:20:36 -07:00
2023-05-31 16:01:50 +08:00
if ( ! ( icsk - > icsk_ack . pending & ICSK_ACK_TIMER ) )
2022-06-08 23:34:11 -07:00
return ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:10:42 -07:00
if ( time_after ( icsk - > icsk_ack . timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_delack_timer , icsk - > icsk_ack . timeout ) ;
2022-06-08 23:34:11 -07:00
return ;
2005-04-16 15:20:36 -07:00
}
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . pending & = ~ ICSK_ACK_TIMER ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:10:42 -07:00
if ( inet_csk_ack_scheduled ( sk ) ) {
2019-01-25 10:53:19 -08:00
if ( ! inet_csk_in_pingpong_mode ( sk ) ) {
2005-04-16 15:20:36 -07:00
/* Delayed ACK missed: inflate ATO. */
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . ato = min ( icsk - > icsk_ack . ato < < 1 , icsk - > icsk_rto ) ;
2005-04-16 15:20:36 -07:00
} else {
/* Delayed ACK missed: leave pingpong mode and
* deflate ATO .
*/
2019-01-25 10:53:19 -08:00
inet_csk_exit_pingpong_mode ( sk ) ;
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . ato = TCP_ATO_MIN ;
2005-04-16 15:20:36 -07:00
}
2023-05-31 16:01:50 +08:00
tcp_mstamp_refresh ( tp ) ;
2005-04-16 15:20:36 -07:00
tcp_send_ack ( sk ) ;
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKS ) ;
2005-04-16 15:20:36 -07:00
}
2012-07-20 05:45:50 +00:00
}
2016-07-16 04:04:34 +02:00
/**
* tcp_delack_timer ( ) - The TCP delayed ACK timeout handler
2020-07-13 01:15:02 +02:00
* @ t : Pointer to the timer . ( gets casted to struct sock * )
2016-07-16 04:04:34 +02:00
*
* This function gets ( indirectly ) called when the kernel timer for a TCP packet
* of this socket expires . Calls tcp_delack_timer_handler ( ) to do the actual work .
*
* Returns : Nothing ( void )
*/
2017-10-16 17:29:19 -07:00
static void tcp_delack_timer ( struct timer_list * t )
2012-07-20 05:45:50 +00:00
{
2017-10-16 17:29:19 -07:00
struct inet_connection_sock * icsk =
from_timer ( icsk , t , icsk_delack_timer ) ;
struct sock * sk = & icsk - > icsk_inet . sk ;
2012-07-20 05:45:50 +00:00
bh_lock_sock ( sk ) ;
if ( ! sock_owned_by_user ( sk ) ) {
tcp_delack_timer_handler ( sk ) ;
} else {
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKLOCKED ) ;
2012-07-20 05:45:50 +00:00
/* deleguate our work to tcp_release_cb() */
2016-12-03 11:14:57 -08:00
if ( ! test_and_set_bit ( TCP_DELACK_TIMER_DEFERRED , & sk - > sk_tsq_flags ) )
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 00:22:46 +00:00
sock_hold ( sk ) ;
2012-07-20 05:45:50 +00:00
}
2005-04-16 15:20:36 -07:00
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
static void tcp_probe_timer ( struct sock * sk )
{
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2017-10-05 22:21:27 -07:00
struct sk_buff * skb = tcp_send_head ( sk ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
int max_probes ;
2017-10-05 22:21:27 -07:00
if ( tp - > packets_out | | ! skb ) {
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out = 0 ;
2021-01-15 14:30:58 -08:00
icsk - > icsk_probes_tstamp = 0 ;
2005-04-16 15:20:36 -07:00
return ;
}
2014-09-29 13:20:38 -07:00
/* RFC 1122 4.2.2.17 requires the sender to stay open indefinitely as
* long as the receiver continues to respond probes . We support this by
* default and reset icsk_probes_out with incoming ACKs . But if the
* socket is orphaned or the user specifies TCP_USER_TIMEOUT , we
* kill the socket when the retry count and the time exceeds the
* corresponding system limit . We also implement similar policy when
* we use RTO to probe window in tcp_retransmit_timer ( ) .
2005-04-16 15:20:36 -07:00
*/
2023-08-04 14:46:12 +00:00
if ( ! icsk - > icsk_probes_tstamp ) {
2021-01-15 14:30:58 -08:00
icsk - > icsk_probes_tstamp = tcp_jiffies32 ;
2023-08-04 14:46:12 +00:00
} else {
u32 user_timeout = READ_ONCE ( icsk - > icsk_user_timeout ) ;
2005-04-16 15:20:36 -07:00
2023-08-04 14:46:12 +00:00
if ( user_timeout & &
( s32 ) ( tcp_jiffies32 - icsk - > icsk_probes_tstamp ) > =
msecs_to_jiffies ( user_timeout ) )
goto abort ;
}
2022-07-15 10:17:50 -07:00
max_probes = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_retries2 ) ;
2005-04-16 15:20:36 -07:00
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2015-10-09 02:41:37 +02:00
const bool alive = inet_csk_rto_backoff ( icsk , TCP_RTO_MAX ) < TCP_RTO_MAX ;
2007-02-09 23:24:47 +09:00
2005-04-16 15:20:36 -07:00
max_probes = tcp_orphan_retries ( sk , alive ) ;
2014-09-29 13:20:38 -07:00
if ( ! alive & & icsk - > icsk_backoff > = max_probes )
goto abort ;
if ( tcp_out_of_resources ( sk , true ) )
2005-04-16 15:20:36 -07:00
return ;
}
2018-11-28 16:06:43 -08:00
if ( icsk - > icsk_probes_out > = max_probes ) {
2014-09-29 13:20:38 -07:00
abort : tcp_write_err ( sk ) ;
2005-04-16 15:20:36 -07:00
} else {
/* Only send another probe if we didn't close things up. */
tcp_send_probe0 ( sk ) ;
}
}
2012-08-31 12:29:12 +00:00
/*
* Timer for Fast Open socket to retransmit SYNACK . Note that the
* sk here is the child socket , not the parent ( listener ) socket .
*/
2019-10-10 20:17:38 -07:00
static void tcp_fastopen_synack_timer ( struct sock * sk , struct request_sock * req )
2012-08-31 12:29:12 +00:00
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2019-01-16 15:05:31 -08:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2022-07-15 10:17:46 -07:00
int max_retries ;
2012-08-31 12:29:12 +00:00
2015-03-22 10:22:19 -07:00
req - > rsk_ops - > syn_ack_timeout ( req ) ;
2012-08-31 12:29:12 +00:00
2023-08-04 14:46:11 +00:00
/* Add one more retry for fastopen.
* Paired with WRITE_ONCE ( ) in tcp_sock_set_syncnt ( )
*/
max_retries = READ_ONCE ( icsk - > icsk_syn_retries ) ? :
2022-07-15 10:17:46 -07:00
READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_synack_retries ) + 1 ;
2012-10-27 23:16:46 +00:00
if ( req - > num_timeout > = max_retries ) {
2012-08-31 12:29:12 +00:00
tcp_write_err ( sk ) ;
return ;
}
2019-04-29 15:46:17 -07:00
/* Lower cwnd after certain SYNACK timeout like tcp_init_transfer() */
if ( icsk - > icsk_retransmits = = 1 )
tcp_enter_loss ( sk ) ;
2012-08-31 12:29:12 +00:00
/* XXX (TFO) - Unlike regular SYN-ACK retransmit, we ignore error
* returned from rtx_syn_ack ( ) to make it more persistent like
* regular retransmit because if the child socket has been accepted
* it ' s not good to give up too easily .
*/
2012-10-27 23:16:46 +00:00
inet_rtx_syn_ack ( sk , req ) ;
req - > num_timeout + + ;
2016-09-21 16:16:15 -07:00
icsk - > icsk_retransmits + + ;
2019-01-16 15:05:31 -08:00
if ( ! tp - > retrans_stamp )
tp - > retrans_stamp = tcp_time_stamp ( tp ) ;
2012-08-31 12:29:12 +00:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
tcp: Make SYN ACK RTO tunable by BPF programs with TFO
Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init
to initiate req->timeout like the non TFO SYN ACK case.
Tested using the following packetdrill script, on a host with a BPF
program that sets the initial connect timeout to 10ms.
`../../common/defaults.sh`
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 < . 1:1(0) ack 1 win 32792
+0 accept(3, ..., ...) = 4
Signed-off-by: Jie Meng <jmeng@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-15 13:29:00 -07:00
req - > timeout < < req - > num_timeout , TCP_RTO_MAX ) ;
2012-08-31 12:29:12 +00:00
}
net: tcp: fix unexcepted socket die when snd_wnd is 0
In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.
The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.
However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.
Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.
Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.
However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.
Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 10:55:29 +08:00
static bool tcp_rtx_probe0_timed_out ( const struct sock * sk ,
const struct sk_buff * skb )
{
const struct tcp_sock * tp = tcp_sk ( sk ) ;
const int timeout = TCP_RTO_MAX * 2 ;
u32 rcv_delta , rtx_delta ;
rcv_delta = inet_csk ( sk ) - > icsk_timeout - tp - > rcv_tstamp ;
if ( rcv_delta < = timeout )
return false ;
rtx_delta = ( u32 ) msecs_to_jiffies ( tcp_time_stamp ( tp ) -
( tp - > retrans_stamp ? : tcp_skb_timestamp ( skb ) ) ) ;
return rtx_delta > timeout ;
}
2005-04-16 15:20:36 -07:00
2016-07-16 04:04:34 +02:00
/**
* tcp_retransmit_timer ( ) - The TCP retransmit timeout handler
* @ sk : Pointer to the current socket .
*
* This function gets called when the kernel timer for a TCP packet
* of this socket expires .
*
2021-06-07 23:01:09 +08:00
* It handles retransmission , timer adjustment and other necessary measures .
2016-07-16 04:04:34 +02:00
*
* Returns : Nothing ( void )
*/
Revert Backoff [v3]: Revert RTO on ICMP destination unreachable
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.
This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.
Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:31 +00:00
void tcp_retransmit_timer ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2016-02-03 09:46:53 +02:00
struct net * net = sock_net ( sk ) ;
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2019-10-10 20:17:38 -07:00
struct request_sock * req ;
2019-12-03 08:05:52 -08:00
struct sk_buff * skb ;
2005-04-16 15:20:36 -07:00
2019-10-10 20:17:38 -07:00
req = rcu_dereference_protected ( tp - > fastopen_rsk ,
lockdep_sock_is_held ( sk ) ) ;
if ( req ) {
2012-10-22 11:26:36 +00:00
WARN_ON_ONCE ( sk - > sk_state ! = TCP_SYN_RECV & &
sk - > sk_state ! = TCP_FIN_WAIT1 ) ;
2019-10-10 20:17:38 -07:00
tcp_fastopen_synack_timer ( sk , req ) ;
2012-08-31 12:29:12 +00:00
/* Before we receive ACK to our SYN-ACK don't retransmit
* anything else ( e . g . , data or FIN segments ) .
*/
return ;
}
2019-12-03 08:05:52 -08:00
if ( ! tp - > packets_out )
return ;
skb = tcp_rtx_queue_head ( sk ) ;
if ( WARN_ON_ONCE ( ! skb ) )
2019-01-16 15:05:28 -08:00
return ;
2005-04-16 15:20:36 -07:00
2013-03-11 10:00:44 +00:00
tp - > tlp_high_seq = 0 ;
2005-04-16 15:20:36 -07:00
if ( ! tp - > snd_wnd & & ! sock_flag ( sk , SOCK_DEAD ) & &
! ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) ) {
/* Receiver dastardly shrinks window. Our retransmits
* become zero probes , but we should not timeout this
* connection . If the socket is an orphan , time it out ,
* we cannot allow such beasts to hang infinitely .
*/
2008-04-14 04:09:36 -07:00
struct inet_sock * inet = inet_sk ( sk ) ;
net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.
Change it to probing zero-window, as it is a expected case now. The
description may be not correct.
Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 10:55:30 +08:00
u32 rtx_delta ;
rtx_delta = tcp_time_stamp ( tp ) - ( tp - > retrans_stamp ? : tcp_skb_timestamp ( skb ) ) ;
2008-04-14 04:09:36 -07:00
if ( sk - > sk_family = = AF_INET ) {
net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.
Change it to probing zero-window, as it is a expected case now. The
description may be not correct.
Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 10:55:30 +08:00
net_dbg_ratelimited ( " Probing zero-window on %pI4:%u/%u, seq=%u:%u, recv %ums ago, lasting %ums \n " ,
& inet - > inet_daddr , ntohs ( inet - > inet_dport ) ,
inet - > inet_num , tp - > snd_una , tp - > snd_nxt ,
jiffies_to_msecs ( jiffies - tp - > rcv_tstamp ) ,
rtx_delta ) ;
2005-04-16 15:20:36 -07:00
}
2011-12-10 09:48:31 +00:00
# if IS_ENABLED(CONFIG_IPV6)
2008-04-14 04:09:36 -07:00
else if ( sk - > sk_family = = AF_INET6 ) {
net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.
Change it to probing zero-window, as it is a expected case now. The
description may be not correct.
Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 10:55:30 +08:00
net_dbg_ratelimited ( " Probing zero-window on %pI6:%u/%u, seq=%u:%u, recv %ums ago, lasting %ums \n " ,
& sk - > sk_v6_daddr , ntohs ( inet - > inet_dport ) ,
inet - > inet_num , tp - > snd_una , tp - > snd_nxt ,
jiffies_to_msecs ( jiffies - tp - > rcv_tstamp ) ,
rtx_delta ) ;
2008-04-14 04:09:36 -07:00
}
2005-04-16 15:20:36 -07:00
# endif
net: tcp: fix unexcepted socket die when snd_wnd is 0
In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.
The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.
However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.
Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.
Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.
However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.
Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 10:55:29 +08:00
if ( tcp_rtx_probe0_timed_out ( sk , skb ) ) {
2005-04-16 15:20:36 -07:00
tcp_write_err ( sk ) ;
goto out ;
}
tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04 19:12:29 -04:00
tcp_enter_loss ( sk ) ;
2019-12-03 08:05:52 -08:00
tcp_retransmit_skb ( sk , skb , 1 ) ;
2005-04-16 15:20:36 -07:00
__sk_dst_reset ( sk ) ;
goto out_reset_timer ;
}
2018-11-28 16:06:45 -08:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPTIMEOUTS ) ;
2005-04-16 15:20:36 -07:00
if ( tcp_write_timeout ( sk ) )
goto out ;
2005-08-09 20:10:42 -07:00
if ( icsk - > icsk_retransmits = = 0 ) {
2018-11-28 16:06:45 -08:00
int mib_idx = 0 ;
2008-07-03 01:05:41 -07:00
2010-10-14 01:52:09 +00:00
if ( icsk - > icsk_ca_state = = TCP_CA_Recovery ) {
2009-02-28 04:44:34 +00:00
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL ;
else
mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL ;
2005-08-10 04:03:31 -03:00
} else if ( icsk - > icsk_ca_state = = TCP_CA_Loss ) {
2008-07-03 01:05:41 -07:00
mib_idx = LINUX_MIB_TCPLOSSFAILURES ;
2010-10-14 01:52:09 +00:00
} else if ( ( icsk - > icsk_ca_state = = TCP_CA_Disorder ) | |
tp - > sacked_out ) {
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKFAILURES ;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES ;
2005-04-16 15:20:36 -07:00
}
2018-11-28 16:06:45 -08:00
if ( mib_idx )
__NET_INC_STATS ( sock_net ( sk ) , mib_idx ) ;
2005-04-16 15:20:36 -07:00
}
tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04 19:12:29 -04:00
tcp_enter_loss ( sk ) ;
2005-04-16 15:20:36 -07:00
2019-01-16 15:05:34 -08:00
icsk - > icsk_retransmits + + ;
2017-10-05 22:21:27 -07:00
if ( tcp_retransmit_skb ( sk , tcp_rtx_queue_head ( sk ) , 1 ) > 0 ) {
2005-04-16 15:20:36 -07:00
/* Retransmission failed because of local congestion,
2019-01-16 15:05:34 -08:00
* Let senders fight for local resources conservatively .
2005-04-16 15:20:36 -07:00
*/
2005-08-09 20:10:42 -07:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
2019-01-16 15:05:34 -08:00
TCP_RESOURCE_PROBE_INTERVAL ,
2005-08-09 20:11:08 -07:00
TCP_RTO_MAX ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
/* Increase the timeout each time we retransmit. Note that
* we do not increase the rtt estimate . rto is initialized
* from rtt , but increases here . Jacobson ( SIGCOMM 88 ) suggests
* that doubling rto each time is the least we can get away with .
* In KA9Q , Karn uses this for the first few times , and then
* goes to quadratic . netBSD doubles , but only goes up to * 64 ,
* and clamps at 1 to 64 sec afterwards . Note that 120 sec is
* defined in the protocol as the maximum possible RTT . I guess
* we ' ll have to use something other than TCP to talk to the
* University of Mars .
*
* PAWS allows us longer timeouts and large windows , so once
* implemented ftp to mars will work nicely . We will have to fix
* the 120 second clamps though !
*/
2005-08-09 20:10:42 -07:00
icsk - > icsk_backoff + + ;
2005-04-16 15:20:36 -07:00
out_reset_timer :
2010-02-18 02:47:01 +00:00
/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
* used to reset timer , set to 0. Recalculate ' icsk_rto ' as this
* might be increased if the stream oscillates between thin and thick ,
* thus the old value might already be too high compared to the value
* set by ' tcp_set_rto ' in tcp_input . c which resets the rto without
* backoff . Limit to TCP_THIN_LINEAR_RETRIES before initiating
* exponential backoff behaviour to avoid continue hammering
* linear - timeout retransmissions into a black hole
*/
if ( sk - > sk_state = = TCP_ESTABLISHED & &
2022-07-18 10:26:47 -07:00
( tp - > thin_lto | | READ_ONCE ( net - > ipv4 . sysctl_tcp_thin_linear_timeouts ) ) & &
2010-02-18 02:47:01 +00:00
tcp_stream_is_thin ( tp ) & &
icsk - > icsk_retransmits < = TCP_THIN_LINEAR_RETRIES ) {
icsk - > icsk_backoff = 0 ;
2023-08-11 10:37:47 +08:00
icsk - > icsk_rto = clamp ( __tcp_set_rto ( tp ) ,
tcp_rto_min ( sk ) ,
TCP_RTO_MAX ) ;
tcp: make the first N SYN RTO backoffs linear
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.
We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts
This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-09 18:05:58 +00:00
} else if ( sk - > sk_state ! = TCP_SYN_SENT | |
icsk - > icsk_backoff >
READ_ONCE ( net - > ipv4 . sysctl_tcp_syn_linear_timeouts ) ) {
/* Use normal (exponential) backoff unless linear timeouts are
* activated .
*/
2010-02-18 02:47:01 +00:00
icsk - > icsk_rto = min ( icsk - > icsk_rto < < 1 , TCP_RTO_MAX ) ;
}
2018-07-19 11:14:44 +10:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
tcp_clamp_rto_to_user_timeout ( sk ) , TCP_RTO_MAX ) ;
2022-07-15 10:17:50 -07:00
if ( retransmits_timed_out ( sk , READ_ONCE ( net - > ipv4 . sysctl_tcp_retries1 ) + 1 , 0 ) )
2005-04-16 15:20:36 -07:00
__sk_dst_reset ( sk ) ;
out : ;
}
2016-07-16 04:04:34 +02:00
/* Called with bottom-half processing disabled.
Called by tcp_write_timer ( ) */
2012-07-20 05:45:50 +00:00
void tcp_write_timer_handler ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
int event ;
2017-03-03 14:08:21 -08:00
if ( ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) | |
! icsk - > icsk_pending )
2022-06-08 23:34:11 -07:00
return ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:10:42 -07:00
if ( time_after ( icsk - > icsk_timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_retransmit_timer , icsk - > icsk_timeout ) ;
2022-06-08 23:34:11 -07:00
return ;
2005-04-16 15:20:36 -07:00
}
2017-05-16 14:00:14 -07:00
tcp_mstamp_refresh ( tcp_sk ( sk ) ) ;
2005-08-09 20:10:42 -07:00
event = icsk - > icsk_pending ;
2005-04-16 15:20:36 -07:00
switch ( event ) {
2017-01-12 22:11:33 -08:00
case ICSK_TIME_REO_TIMEOUT :
tcp_rack_reo_timeout ( sk ) ;
break ;
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
case ICSK_TIME_LOSS_PROBE :
tcp_send_loss_probe ( sk ) ;
break ;
2005-08-09 20:10:42 -07:00
case ICSK_TIME_RETRANS :
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
icsk - > icsk_pending = 0 ;
2005-04-16 15:20:36 -07:00
tcp_retransmit_timer ( sk ) ;
break ;
2005-08-09 20:10:42 -07:00
case ICSK_TIME_PROBE0 :
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 10:00:43 +00:00
icsk - > icsk_pending = 0 ;
2005-04-16 15:20:36 -07:00
tcp_probe_timer ( sk ) ;
break ;
}
2012-07-20 05:45:50 +00:00
}
2017-10-16 17:29:19 -07:00
static void tcp_write_timer ( struct timer_list * t )
2012-07-20 05:45:50 +00:00
{
2017-10-16 17:29:19 -07:00
struct inet_connection_sock * icsk =
from_timer ( icsk , t , icsk_retransmit_timer ) ;
struct sock * sk = & icsk - > icsk_inet . sk ;
2012-07-20 05:45:50 +00:00
bh_lock_sock ( sk ) ;
if ( ! sock_owned_by_user ( sk ) ) {
tcp_write_timer_handler ( sk ) ;
} else {
2016-07-16 04:04:34 +02:00
/* delegate our work to tcp_release_cb() */
2016-12-03 11:14:57 -08:00
if ( ! test_and_set_bit ( TCP_WRITE_TIMER_DEFERRED , & sk - > sk_tsq_flags ) )
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 00:22:46 +00:00
sock_hold ( sk ) ;
2012-07-20 05:45:50 +00:00
}
2005-04-16 15:20:36 -07:00
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
2015-03-22 10:22:19 -07:00
void tcp_syn_ack_timeout ( const struct request_sock * req )
2010-01-17 19:09:39 -08:00
{
2015-03-22 10:22:19 -07:00
struct net * net = read_pnet ( & inet_rsk ( req ) - > ireq_net ) ;
2016-04-27 16:44:39 -07:00
__NET_INC_STATS ( net , LINUX_MIB_TCPTIMEOUTS ) ;
2010-01-17 19:09:39 -08:00
}
EXPORT_SYMBOL ( tcp_syn_ack_timeout ) ;
2005-04-16 15:20:36 -07:00
void tcp_set_keepalive ( struct sock * sk , int val )
{
if ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) )
return ;
if ( val & & ! sock_flag ( sk , SOCK_KEEPOPEN ) )
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , keepalive_time_when ( tcp_sk ( sk ) ) ) ;
2005-04-16 15:20:36 -07:00
else if ( ! val )
2005-08-09 20:10:42 -07:00
inet_csk_delete_keepalive_timer ( sk ) ;
2005-04-16 15:20:36 -07:00
}
2017-01-09 16:55:12 +01:00
EXPORT_SYMBOL_GPL ( tcp_set_keepalive ) ;
2005-04-16 15:20:36 -07:00
2017-10-16 17:29:19 -07:00
static void tcp_keepalive_timer ( struct timer_list * t )
2005-04-16 15:20:36 -07:00
{
2017-10-16 17:29:19 -07:00
struct sock * sk = from_timer ( sk , t , sk_timer ) ;
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2010-04-26 18:33:27 +00:00
u32 elapsed ;
2005-04-16 15:20:36 -07:00
/* Only process if socket is not in use. */
bh_lock_sock ( sk ) ;
if ( sock_owned_by_user ( sk ) ) {
2007-02-09 23:24:47 +09:00
/* Try again later. */
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , HZ / 20 ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
if ( sk - > sk_state = = TCP_LISTEN ) {
inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 19:04:20 -07:00
pr_err ( " Hmm... keepalive on a LISTEN ??? \n " ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2017-12-12 18:22:52 -08:00
tcp_mstamp_refresh ( tp ) ;
2005-04-16 15:20:36 -07:00
if ( sk - > sk_state = = TCP_FIN_WAIT2 & & sock_flag ( sk , SOCK_DEAD ) ) {
2023-08-04 14:46:15 +00:00
if ( READ_ONCE ( tp - > linger2 ) > = 0 ) {
2005-08-09 20:10:42 -07:00
const int tmo = tcp_fin_time ( sk ) - TCP_TIMEWAIT_LEN ;
2005-04-16 15:20:36 -07:00
if ( tmo > 0 ) {
tcp_time_wait ( sk , TCP_FIN_WAIT2 , tmo ) ;
goto out ;
}
}
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
goto death ;
}
net: fix keepalive code vs TCP_FASTOPEN_CONNECT
syzkaller was able to trigger a divide by 0 in TCP stack [1]
Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11
[1]
divide error: 0000 [#1] SMP
CPU: 18 PID: 0 Comm: swapper/18 Not tainted
task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
RIP: 0010:[<ffffffff8409cc0d>] [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
Call Trace:
<IRQ>
[<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
[<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
[<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
[<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
[<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
[<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
[<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
[<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
[<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
[<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
<EOI>
[<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
[<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0
Tested:
Following packetdrill no longer crashes the kernel
`echo 0 >/proc/sys/net/ipv4/tcp_timestamps`
// Cache warmup: send a Fast Open cookie request
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
+0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
+.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
+0 > . 1:1(0) ack 1
+0 close(3) = 0
+0 > F. 1:1(0) ack 1
+0 < F. 1:1(0) ack 2 win 92
+0 > . 2:2(0) ack 2
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
+0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
+.01 connect(4, ..., ...) = 0
+0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
+10 close(4) = 0
`echo 1 >/proc/sys/net/ipv4/tcp_timestamps`
Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 23:10:46 -07:00
if ( ! sock_flag ( sk , SOCK_KEEPOPEN ) | |
( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_SYN_SENT ) ) )
2005-04-16 15:20:36 -07:00
goto out ;
elapsed = keepalive_time_when ( tp ) ;
/* It is alive without keepalive 8) */
2017-10-05 22:21:27 -07:00
if ( tp - > packets_out | | ! tcp_write_queue_empty ( sk ) )
2005-04-16 15:20:36 -07:00
goto resched ;
2010-04-26 18:33:27 +00:00
elapsed = keepalive_time_elapsed ( tp ) ;
2005-04-16 15:20:36 -07:00
if ( elapsed > = keepalive_time_when ( tp ) ) {
2023-08-04 14:46:12 +00:00
u32 user_timeout = READ_ONCE ( icsk - > icsk_user_timeout ) ;
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 19:13:28 +00:00
/* If the TCP_USER_TIMEOUT option is enabled, use that
* to determine when to timeout instead .
*/
2023-08-04 14:46:12 +00:00
if ( ( user_timeout ! = 0 & &
elapsed > = msecs_to_jiffies ( user_timeout ) & &
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 19:13:28 +00:00
icsk - > icsk_probes_out > 0 ) | |
2023-08-04 14:46:12 +00:00
( user_timeout = = 0 & &
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 19:13:28 +00:00
icsk - > icsk_probes_out > = keepalive_probes ( tp ) ) ) {
2005-04-16 15:20:36 -07:00
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_write_err ( sk ) ;
goto out ;
}
2015-05-06 14:26:25 -07:00
if ( tcp_write_wakeup ( sk , LINUX_MIB_TCPKEEPALIVE ) < = 0 ) {
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out + + ;
2005-04-16 15:20:36 -07:00
elapsed = keepalive_intvl_when ( tp ) ;
} else {
/* If keepalive was lost due to local congestion,
* try harder .
*/
elapsed = TCP_RESOURCE_PROBE_INTERVAL ;
}
} else {
/* It is tp->rcv_tstamp + keepalive_time_when(tp) */
elapsed = keepalive_time_when ( tp ) - elapsed ;
}
resched :
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , elapsed ) ;
2005-04-16 15:20:36 -07:00
goto out ;
2007-02-09 23:24:47 +09:00
death :
2005-04-16 15:20:36 -07:00
tcp_done ( sk ) ;
out :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
2012-07-20 05:45:50 +00:00
tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 14:47:26 -07:00
static enum hrtimer_restart tcp_compressed_ack_kick ( struct hrtimer * timer )
{
struct tcp_sock * tp = container_of ( timer , struct tcp_sock , compressed_ack_timer ) ;
struct sock * sk = ( struct sock * ) tp ;
bh_lock_sock ( sk ) ;
if ( ! sock_owned_by_user ( sk ) ) {
2020-04-30 10:35:41 -07:00
if ( tp - > compressed_ack ) {
/* Since we have to send one ack finally,
2021-06-07 23:01:09 +08:00
* subtract one from tp - > compressed_ack to keep
2020-04-30 10:35:41 -07:00
* LINUX_MIB_TCPACKCOMPRESSED accurate .
*/
tp - > compressed_ack - - ;
tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 14:47:26 -07:00
tcp_send_ack ( sk ) ;
2020-04-30 10:35:41 -07:00
}
tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 14:47:26 -07:00
} else {
if ( ! test_and_set_bit ( TCP_DELACK_TIMER_DEFERRED ,
& sk - > sk_tsq_flags ) )
sock_hold ( sk ) ;
}
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
return HRTIMER_NORESTART ;
}
2012-07-20 05:45:50 +00:00
void tcp_init_xmit_timers ( struct sock * sk )
{
inet_csk_init_xmit_timers ( sk , & tcp_write_timer , & tcp_delack_timer ,
& tcp_keepalive_timer ) ;
2018-09-28 10:28:44 -07:00
hrtimer_init ( & tcp_sk ( sk ) - > pacing_timer , CLOCK_MONOTONIC ,
2018-05-10 14:59:43 -07:00
HRTIMER_MODE_ABS_PINNED_SOFT ) ;
2017-05-16 04:24:36 -07:00
tcp_sk ( sk ) - > pacing_timer . function = tcp_pace_kick ;
tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 14:47:26 -07:00
hrtimer_init ( & tcp_sk ( sk ) - > compressed_ack_timer , CLOCK_MONOTONIC ,
HRTIMER_MODE_REL_PINNED_SOFT ) ;
tcp_sk ( sk ) - > compressed_ack_timer . function = tcp_compressed_ack_kick ;
2012-07-20 05:45:50 +00:00
}