2005-04-17 02:20:36 +04:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Implementation of the Transmission Control Protocol ( TCP ) .
*
2005-05-06 03:16:16 +04:00
* Authors : Ross Biro
2005-04-17 02:20:36 +04:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Mark Evans , < evansmp @ uhura . aston . ac . uk >
* Corey Minyard < wf - rch ! minyard @ relay . EU . net >
* Florian La Roche , < flla @ stud . uni - sb . de >
* Charles Hedrick , < hedrick @ klinzhai . rutgers . edu >
* Linus Torvalds , < torvalds @ cs . helsinki . fi >
* Alan Cox , < gw4pts @ gw4pts . ampr . org >
* Matthew Dillon , < dillon @ apollo . west . oic . com >
* Arnt Gulbrandsen , < agulbra @ nvg . unit . no >
* Jorge Cwik , < jorge @ laser . satlink . net >
*/
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/gfp.h>
2005-04-17 02:20:36 +04:00
# include <net/tcp.h>
2010-02-18 05:47:01 +03:00
int sysctl_tcp_thin_linear_timeouts __read_mostly ;
2005-04-17 02:20:36 +04:00
2016-07-16 05:04:34 +03:00
/**
* tcp_write_err ( ) - close socket and save error info
* @ sk : The socket the error has appeared on .
*
* Returns : Nothing ( void )
*/
2005-04-17 02:20:36 +04:00
static void tcp_write_err ( struct sock * sk )
{
sk - > sk_err = sk - > sk_err_soft ? : ETIMEDOUT ;
sk - > sk_error_report ( sk ) ;
tcp_done ( sk ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONTIMEOUT ) ;
2005-04-17 02:20:36 +04:00
}
2016-07-16 05:04:34 +03:00
/**
* tcp_out_of_resources ( ) - Close socket if out of resources
* @ sk : pointer to current socket
* @ do_reset : send a last packet with reset flag
2005-04-17 02:20:36 +04:00
*
2016-07-16 05:04:34 +03:00
* Do not allow orphaned sockets to eat all our resources .
* This is direct violation of TCP specs , but it is required
* to prevent DoS attacks . It is called when a retransmission timeout
* or zero probe timeout occurs on orphaned socket .
*
* Criteria is still not confirmed experimentally and may change .
* We kill the socket , if :
* 1. If number of orphaned sockets exceeds an administratively configured
* limit .
* 2. If we have strong memory pressure .
2005-04-17 02:20:36 +04:00
*/
2014-09-30 00:20:38 +04:00
static int tcp_out_of_resources ( struct sock * sk , bool do_reset )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2010-08-25 13:27:49 +04:00
int shift = 0 ;
2005-04-17 02:20:36 +04:00
2007-02-09 17:24:47 +03:00
/* If peer does not open window for long time, or did not transmit
2005-04-17 02:20:36 +04:00
* anything for long time , penalize it . */
2017-05-17 00:00:03 +03:00
if ( ( s32 ) ( tcp_jiffies32 - tp - > lsndtime ) > 2 * TCP_RTO_MAX | | ! do_reset )
2010-08-25 13:27:49 +04:00
shift + + ;
2005-04-17 02:20:36 +04:00
/* If some dubious ICMP arrived, penalize even more. */
if ( sk - > sk_err_soft )
2010-08-25 13:27:49 +04:00
shift + + ;
2005-04-17 02:20:36 +04:00
2012-01-31 02:16:06 +04:00
if ( tcp_check_oom ( sk , shift ) ) {
2005-04-17 02:20:36 +04:00
/* Catch exceptional cases, when connection requires reset.
* 1. Last segment was sent recently . */
2017-05-17 00:00:03 +03:00
if ( ( s32 ) ( tcp_jiffies32 - tp - > lsndtime ) < = TCP_TIMEWAIT_LEN | |
2005-04-17 02:20:36 +04:00
/* 2. Window is closed. */
( ! tp - > snd_wnd & & ! tp - > packets_out ) )
2014-09-30 00:20:38 +04:00
do_reset = true ;
2005-04-17 02:20:36 +04:00
if ( do_reset )
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_done ( sk ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONMEMORY ) ;
2005-04-17 02:20:36 +04:00
return 1 ;
}
return 0 ;
}
2016-07-16 05:04:34 +03:00
/**
* tcp_orphan_retries ( ) - Returns maximal number of retries on an orphaned socket
* @ sk : Pointer to the current socket .
* @ alive : bool , socket alive state
*/
2015-10-09 03:41:37 +03:00
static int tcp_orphan_retries ( struct sock * sk , bool alive )
2005-04-17 02:20:36 +04:00
{
2016-02-03 10:46:55 +03:00
int retries = sock_net ( sk ) - > ipv4 . sysctl_tcp_orphan_retries ; /* May be zero. */
2005-04-17 02:20:36 +04:00
/* We know from an ICMP that something is wrong. */
if ( sk - > sk_err_soft & & ! alive )
retries = 0 ;
/* However, if socket sent something recently, select some safe
* number of retries . 8 corresponds to > 100 seconds with minimal
* RTO of 200 msec . */
if ( retries = = 0 & & alive )
retries = 8 ;
return retries ;
}
2007-12-21 12:50:43 +03:00
static void tcp_mtu_probing ( struct inet_connection_sock * icsk , struct sock * sk )
{
2015-02-10 04:53:16 +03:00
struct net * net = sock_net ( sk ) ;
2007-12-21 12:50:43 +03:00
/* Black hole detection */
2015-02-10 04:53:16 +03:00
if ( net - > ipv4 . sysctl_tcp_mtu_probing ) {
2007-12-21 12:50:43 +03:00
if ( ! icsk - > icsk_mtup . enabled ) {
icsk - > icsk_mtup . enabled = 1 ;
2015-03-06 06:18:24 +03:00
icsk - > icsk_mtup . probe_timestamp = tcp_time_stamp ;
2007-12-21 12:50:43 +03:00
tcp_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
} else {
2015-02-10 04:53:16 +03:00
struct net * net = sock_net ( sk ) ;
2007-12-21 12:50:43 +03:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2007-12-21 15:29:16 +03:00
int mss ;
2007-12-21 16:58:29 +03:00
mss = tcp_mtu_to_mss ( sk , icsk - > icsk_mtup . search_low ) > > 1 ;
2015-02-10 04:53:16 +03:00
mss = min ( net - > ipv4 . sysctl_tcp_base_mss , mss ) ;
2007-12-21 12:50:43 +03:00
mss = max ( mss , 68 - tp - > tcp_header_len ) ;
icsk - > icsk_mtup . search_low = tcp_mss_to_mtu ( sk , mss ) ;
tcp_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
}
}
}
2016-07-16 05:04:34 +03:00
/**
* retransmits_timed_out ( ) - returns true if this connection has timed out
* @ sk : The current socket
* @ boundary : max number of retransmissions
* @ timeout : A custom timeout value .
* If set to 0 the default timeout is calculated and used .
* Using TCP_RTO_MIN and the number of unsuccessful retransmits .
* @ syn_set : true if the SYN Bit was set .
*
* The default " timeout " value this function can calculate and use
* is equivalent to the timeout of a TCP Connection
* after " boundary " unsuccessful , exponentially backed - off
2010-09-29 00:08:32 +04:00
* retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
* syn_set flag is set .
2016-07-16 05:04:34 +03:00
*
2009-12-07 09:06:16 +03:00
*/
static bool retransmits_timed_out ( struct sock * sk ,
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
unsigned int boundary ,
2010-10-04 22:56:38 +04:00
unsigned int timeout ,
2010-09-29 00:08:32 +04:00
bool syn_set )
2009-12-07 09:06:16 +03:00
{
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
unsigned int linear_backoff_thresh , start_ts ;
2010-09-29 00:08:32 +04:00
unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN ;
2009-12-07 09:06:16 +03:00
if ( ! inet_csk ( sk ) - > icsk_retransmits )
return false ;
2014-09-06 02:33:33 +04:00
start_ts = tcp_sk ( sk ) - > retrans_stamp ;
if ( unlikely ( ! start_ts ) )
start_ts = tcp_skb_timestamp ( tcp_write_queue_head ( sk ) ) ;
2009-12-07 09:06:16 +03:00
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
if ( likely ( timeout = = 0 ) ) {
2010-10-04 22:56:38 +04:00
linear_backoff_thresh = ilog2 ( TCP_RTO_MAX / rto_base ) ;
2009-12-07 09:06:16 +03:00
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
if ( boundary < = linear_backoff_thresh )
2010-10-04 22:56:38 +04:00
timeout = ( ( 2 < < boundary ) - 1 ) * rto_base ;
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
else
2010-10-04 22:56:38 +04:00
timeout = ( ( 2 < < linear_backoff_thresh ) - 1 ) * rto_base +
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
( boundary - linear_backoff_thresh ) * TCP_RTO_MAX ;
}
2009-12-07 09:06:16 +03:00
return ( tcp_time_stamp - start_ts ) > = timeout ;
}
2005-04-17 02:20:36 +04:00
/* A write timeout has occurred. Process the after effects. */
static int tcp_write_timeout ( struct sock * sk )
{
2006-03-21 04:53:41 +03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2013-10-29 21:09:05 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2016-02-03 10:46:49 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
int retry_until ;
2011-12-19 17:56:45 +04:00
bool do_reset , syn_set = false ;
2005-04-17 02:20:36 +04:00
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) {
2013-10-29 21:09:05 +04:00
if ( icsk - > icsk_retransmits ) {
2010-04-09 03:03:29 +04:00
dst_negative_advice ( sk ) ;
2013-10-29 21:09:05 +04:00
if ( tp - > syn_fastopen | | tp - > syn_data )
2015-04-07 00:37:27 +03:00
tcp_fastopen_cache_set ( sk , 0 , NULL , true , 0 ) ;
2015-11-19 05:17:31 +03:00
if ( tp - > syn_data & & icsk - > icsk_retransmits = = 1 )
2016-04-30 00:16:47 +03:00
NET_INC_STATS ( sock_net ( sk ) ,
LINUX_MIB_TCPFASTOPENACTIVEFAIL ) ;
2016-09-28 05:03:37 +03:00
} else if ( ! tp - > syn_data & & ! tp - > syn_fastopen ) {
sk_rethink_txhash ( sk ) ;
2013-10-29 21:09:05 +04:00
}
2016-02-03 10:46:49 +03:00
retry_until = icsk - > icsk_syn_retries ? : net - > ipv4 . sysctl_tcp_syn_retries ;
2011-12-19 17:56:45 +04:00
syn_set = true ;
2005-04-17 02:20:36 +04:00
} else {
2016-02-03 10:46:53 +03:00
if ( retransmits_timed_out ( sk , net - > ipv4 . sysctl_tcp_retries1 , 0 , 0 ) ) {
2015-11-19 05:17:30 +03:00
/* Some middle-boxes may black-hole Fast Open _after_
* the handshake . Therefore we conservatively disable
2017-04-21 00:45:48 +03:00
* Fast Open on this path on recurring timeouts after
* successful Fast Open .
2015-11-19 05:17:30 +03:00
*/
2017-04-21 00:45:48 +03:00
if ( tp - > syn_data_acked ) {
2015-11-19 05:17:30 +03:00
tcp_fastopen_cache_set ( sk , 0 , NULL , true , 0 ) ;
2016-02-03 10:46:53 +03:00
if ( icsk - > icsk_retransmits = = net - > ipv4 . sysctl_tcp_retries1 )
2016-04-30 00:16:47 +03:00
NET_INC_STATS ( sock_net ( sk ) ,
LINUX_MIB_TCPFASTOPENACTIVEFAIL ) ;
2015-11-19 05:17:30 +03:00
}
2006-03-21 04:53:41 +03:00
/* Black hole detection */
2007-12-21 12:50:43 +03:00
tcp_mtu_probing ( icsk , sk ) ;
2005-04-17 02:20:36 +04:00
2010-04-09 03:03:29 +04:00
dst_negative_advice ( sk ) ;
2016-09-28 05:03:37 +03:00
} else {
sk_rethink_txhash ( sk ) ;
2005-04-17 02:20:36 +04:00
}
2016-02-03 10:46:54 +03:00
retry_until = net - > ipv4 . sysctl_tcp_retries2 ;
2005-04-17 02:20:36 +04:00
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2015-10-09 03:41:37 +03:00
const bool alive = icsk - > icsk_rto < TCP_RTO_MAX ;
2007-02-09 17:24:47 +03:00
2005-04-17 02:20:36 +04:00
retry_until = tcp_orphan_retries ( sk , alive ) ;
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 04:16:34 +04:00
do_reset = alive | |
2010-10-04 22:56:38 +04:00
! retransmits_timed_out ( sk , retry_until , 0 , 0 ) ;
2005-04-17 02:20:36 +04:00
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 04:16:34 +04:00
if ( tcp_out_of_resources ( sk , do_reset ) )
2005-04-17 02:20:36 +04:00
return 1 ;
}
}
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
if ( retransmits_timed_out ( sk , retry_until ,
2010-10-04 22:56:38 +04:00
syn_set ? 0 : icsk - > icsk_user_timeout , syn_set ) ) {
2005-04-17 02:20:36 +04:00
/* Has it gone just too far? */
tcp_write_err ( sk ) ;
return 1 ;
}
return 0 ;
}
2016-04-30 00:16:47 +03:00
/* Called with BH disabled */
2012-07-20 09:45:50 +04:00
void tcp_delack_timer_handler ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2005-08-10 07:10:42 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
2008-01-11 08:56:38 +03:00
sk_mem_reclaim_partial ( sk ) ;
2005-04-17 02:20:36 +04:00
2017-03-04 01:08:21 +03:00
if ( ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) | |
! ( icsk - > icsk_ack . pending & ICSK_ACK_TIMER ) )
2005-04-17 02:20:36 +04:00
goto out ;
2005-08-10 07:10:42 +04:00
if ( time_after ( icsk - > icsk_ack . timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_delack_timer , icsk - > icsk_ack . timeout ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
2005-08-10 07:10:42 +04:00
icsk - > icsk_ack . pending & = ~ ICSK_ACK_TIMER ;
2005-04-17 02:20:36 +04:00
2005-07-09 01:57:23 +04:00
if ( ! skb_queue_empty ( & tp - > ucopy . prequeue ) ) {
2005-04-17 02:20:36 +04:00
struct sk_buff * skb ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPSCHEDULERFAILED ) ;
2005-04-17 02:20:36 +04:00
while ( ( skb = __skb_dequeue ( & tp - > ucopy . prequeue ) ) ! = NULL )
2008-10-08 01:18:42 +04:00
sk_backlog_rcv ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
tp - > ucopy . memory = 0 ;
}
2005-08-10 07:10:42 +04:00
if ( inet_csk_ack_scheduled ( sk ) ) {
if ( ! icsk - > icsk_ack . pingpong ) {
2005-04-17 02:20:36 +04:00
/* Delayed ACK missed: inflate ATO. */
2005-08-10 07:10:42 +04:00
icsk - > icsk_ack . ato = min ( icsk - > icsk_ack . ato < < 1 , icsk - > icsk_rto ) ;
2005-04-17 02:20:36 +04:00
} else {
/* Delayed ACK missed: leave pingpong mode and
* deflate ATO .
*/
2005-08-10 07:10:42 +04:00
icsk - > icsk_ack . pingpong = 0 ;
icsk - > icsk_ack . ato = TCP_ATO_MIN ;
2005-04-17 02:20:36 +04:00
}
tcp_send_ack ( sk ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKS ) ;
2005-04-17 02:20:36 +04:00
}
out :
2015-05-15 22:39:27 +03:00
if ( tcp_under_memory_pressure ( sk ) )
2007-12-31 11:11:19 +03:00
sk_mem_reclaim ( sk ) ;
2012-07-20 09:45:50 +04:00
}
2016-07-16 05:04:34 +03:00
/**
* tcp_delack_timer ( ) - The TCP delayed ACK timeout handler
* @ data : Pointer to the current socket . ( gets casted to struct sock * )
*
* This function gets ( indirectly ) called when the kernel timer for a TCP packet
* of this socket expires . Calls tcp_delack_timer_handler ( ) to do the actual work .
*
* Returns : Nothing ( void )
*/
2012-07-20 09:45:50 +04:00
static void tcp_delack_timer ( unsigned long data )
{
struct sock * sk = ( struct sock * ) data ;
bh_lock_sock ( sk ) ;
if ( ! sock_owned_by_user ( sk ) ) {
tcp_delack_timer_handler ( sk ) ;
} else {
inet_csk ( sk ) - > icsk_ack . blocked = 1 ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKLOCKED ) ;
2012-07-20 09:45:50 +04:00
/* deleguate our work to tcp_release_cb() */
2016-12-03 22:14:57 +03:00
if ( ! test_and_set_bit ( TCP_DELACK_TIMER_DEFERRED , & sk - > sk_tsq_flags ) )
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 04:22:46 +04:00
sock_hold ( sk ) ;
2012-07-20 09:45:50 +04:00
}
2005-04-17 02:20:36 +04:00
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
static void tcp_probe_timer ( struct sock * sk )
{
2005-08-10 11:03:31 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
int max_probes ;
2014-09-30 00:20:38 +04:00
u32 start_ts ;
2005-04-17 02:20:36 +04:00
2007-03-07 23:12:44 +03:00
if ( tp - > packets_out | | ! tcp_send_head ( sk ) ) {
2005-08-10 11:03:31 +04:00
icsk - > icsk_probes_out = 0 ;
2005-04-17 02:20:36 +04:00
return ;
}
2014-09-30 00:20:38 +04:00
/* RFC 1122 4.2.2.17 requires the sender to stay open indefinitely as
* long as the receiver continues to respond probes . We support this by
* default and reset icsk_probes_out with incoming ACKs . But if the
* socket is orphaned or the user specifies TCP_USER_TIMEOUT , we
* kill the socket when the retry count and the time exceeds the
* corresponding system limit . We also implement similar policy when
* we use RTO to probe window in tcp_retransmit_timer ( ) .
2005-04-17 02:20:36 +04:00
*/
2014-09-30 00:20:38 +04:00
start_ts = tcp_skb_timestamp ( tcp_send_head ( sk ) ) ;
if ( ! start_ts )
2017-05-17 00:00:00 +03:00
tcp_send_head ( sk ) - > skb_mstamp = tp - > tcp_mstamp ;
2014-09-30 00:20:38 +04:00
else if ( icsk - > icsk_user_timeout & &
( s32 ) ( tcp_time_stamp - start_ts ) > icsk - > icsk_user_timeout )
goto abort ;
2005-04-17 02:20:36 +04:00
2016-02-03 10:46:54 +03:00
max_probes = sock_net ( sk ) - > ipv4 . sysctl_tcp_retries2 ;
2005-04-17 02:20:36 +04:00
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2015-10-09 03:41:37 +03:00
const bool alive = inet_csk_rto_backoff ( icsk , TCP_RTO_MAX ) < TCP_RTO_MAX ;
2007-02-09 17:24:47 +03:00
2005-04-17 02:20:36 +04:00
max_probes = tcp_orphan_retries ( sk , alive ) ;
2014-09-30 00:20:38 +04:00
if ( ! alive & & icsk - > icsk_backoff > = max_probes )
goto abort ;
if ( tcp_out_of_resources ( sk , true ) )
2005-04-17 02:20:36 +04:00
return ;
}
2005-08-10 11:03:31 +04:00
if ( icsk - > icsk_probes_out > max_probes ) {
2014-09-30 00:20:38 +04:00
abort : tcp_write_err ( sk ) ;
2005-04-17 02:20:36 +04:00
} else {
/* Only send another probe if we didn't close things up. */
tcp_send_probe0 ( sk ) ;
}
}
2012-08-31 16:29:12 +04:00
/*
* Timer for Fast Open socket to retransmit SYNACK . Note that the
* sk here is the child socket , not the parent ( listener ) socket .
*/
static void tcp_fastopen_synack_timer ( struct sock * sk )
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
int max_retries = icsk - > icsk_syn_retries ? :
2016-02-03 10:46:50 +03:00
sock_net ( sk ) - > ipv4 . sysctl_tcp_synack_retries + 1 ; /* add one more retry for fastopen */
2012-08-31 16:29:12 +04:00
struct request_sock * req ;
req = tcp_sk ( sk ) - > fastopen_rsk ;
2015-03-22 20:22:19 +03:00
req - > rsk_ops - > syn_ack_timeout ( req ) ;
2012-08-31 16:29:12 +04:00
2012-10-28 03:16:46 +04:00
if ( req - > num_timeout > = max_retries ) {
2012-08-31 16:29:12 +04:00
tcp_write_err ( sk ) ;
return ;
}
/* XXX (TFO) - Unlike regular SYN-ACK retransmit, we ignore error
* returned from rtx_syn_ack ( ) to make it more persistent like
* regular retransmit because if the child socket has been accepted
* it ' s not good to give up too easily .
*/
2012-10-28 03:16:46 +04:00
inet_rtx_syn_ack ( sk , req ) ;
req - > num_timeout + + ;
2016-09-22 02:16:15 +03:00
icsk - > icsk_retransmits + + ;
2012-08-31 16:29:12 +04:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
2012-10-28 03:16:46 +04:00
TCP_TIMEOUT_INIT < < req - > num_timeout , TCP_RTO_MAX ) ;
2012-08-31 16:29:12 +04:00
}
2005-04-17 02:20:36 +04:00
2016-07-16 05:04:34 +03:00
/**
* tcp_retransmit_timer ( ) - The TCP retransmit timeout handler
* @ sk : Pointer to the current socket .
*
* This function gets called when the kernel timer for a TCP packet
* of this socket expires .
*
* It handles retransmission , timer adjustment and other necesarry measures .
*
* Returns : Nothing ( void )
*/
Revert Backoff [v3]: Revert RTO on ICMP destination unreachable
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.
This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.
Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 04:16:31 +04:00
void tcp_retransmit_timer ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2016-02-03 10:46:53 +03:00
struct net * net = sock_net ( sk ) ;
2005-08-10 07:10:42 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
2012-08-31 16:29:12 +04:00
if ( tp - > fastopen_rsk ) {
2012-10-22 15:26:36 +04:00
WARN_ON_ONCE ( sk - > sk_state ! = TCP_SYN_RECV & &
sk - > sk_state ! = TCP_FIN_WAIT1 ) ;
2012-08-31 16:29:12 +04:00
tcp_fastopen_synack_timer ( sk ) ;
/* Before we receive ACK to our SYN-ACK don't retransmit
* anything else ( e . g . , data or FIN segments ) .
*/
return ;
}
2005-04-17 02:20:36 +04:00
if ( ! tp - > packets_out )
goto out ;
2008-07-26 08:43:18 +04:00
WARN_ON ( tcp_write_queue_empty ( sk ) ) ;
2005-04-17 02:20:36 +04:00
2013-03-11 14:00:44 +04:00
tp - > tlp_high_seq = 0 ;
2005-04-17 02:20:36 +04:00
if ( ! tp - > snd_wnd & & ! sock_flag ( sk , SOCK_DEAD ) & &
! ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) ) {
/* Receiver dastardly shrinks window. Our retransmits
* become zero probes , but we should not timeout this
* connection . If the socket is an orphan , time it out ,
* we cannot allow such beasts to hang infinitely .
*/
2008-04-14 15:09:36 +04:00
struct inet_sock * inet = inet_sk ( sk ) ;
if ( sk - > sk_family = = AF_INET ) {
2014-11-11 21:59:17 +03:00
net_dbg_ratelimited ( " Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired) \n " ,
& inet - > inet_daddr ,
ntohs ( inet - > inet_dport ) ,
inet - > inet_num ,
tp - > snd_una , tp - > snd_nxt ) ;
2005-04-17 02:20:36 +04:00
}
2011-12-10 13:48:31 +04:00
# if IS_ENABLED(CONFIG_IPV6)
2008-04-14 15:09:36 +04:00
else if ( sk - > sk_family = = AF_INET6 ) {
2014-11-11 21:59:17 +03:00
net_dbg_ratelimited ( " Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired) \n " ,
& sk - > sk_v6_daddr ,
ntohs ( inet - > inet_dport ) ,
inet - > inet_num ,
tp - > snd_una , tp - > snd_nxt ) ;
2008-04-14 15:09:36 +04:00
}
2005-04-17 02:20:36 +04:00
# endif
2017-05-17 00:00:07 +03:00
if ( tcp_jiffies32 - tp - > rcv_tstamp > TCP_RTO_MAX ) {
2005-04-17 02:20:36 +04:00
tcp_write_err ( sk ) ;
goto out ;
}
tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 03:12:29 +04:00
tcp_enter_loss ( sk ) ;
tcp-tso: do not split TSO packets at retransmit time
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 20:55:23 +03:00
tcp_retransmit_skb ( sk , tcp_write_queue_head ( sk ) , 1 ) ;
2005-04-17 02:20:36 +04:00
__sk_dst_reset ( sk ) ;
goto out_reset_timer ;
}
if ( tcp_write_timeout ( sk ) )
goto out ;
2005-08-10 07:10:42 +04:00
if ( icsk - > icsk_retransmits = = 0 ) {
2008-07-03 12:05:41 +04:00
int mib_idx ;
2010-10-14 05:52:09 +04:00
if ( icsk - > icsk_ca_state = = TCP_CA_Recovery ) {
2009-02-28 07:44:34 +03:00
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL ;
else
mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL ;
2005-08-10 11:03:31 +04:00
} else if ( icsk - > icsk_ca_state = = TCP_CA_Loss ) {
2008-07-03 12:05:41 +04:00
mib_idx = LINUX_MIB_TCPLOSSFAILURES ;
2010-10-14 05:52:09 +04:00
} else if ( ( icsk - > icsk_ca_state = = TCP_CA_Disorder ) | |
tp - > sacked_out ) {
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKFAILURES ;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES ;
2005-04-17 02:20:36 +04:00
} else {
2008-07-03 12:05:41 +04:00
mib_idx = LINUX_MIB_TCPTIMEOUTS ;
2005-04-17 02:20:36 +04:00
}
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) , mib_idx ) ;
2005-04-17 02:20:36 +04:00
}
tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 03:12:29 +04:00
tcp_enter_loss ( sk ) ;
2005-04-17 02:20:36 +04:00
tcp-tso: do not split TSO packets at retransmit time
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 20:55:23 +03:00
if ( tcp_retransmit_skb ( sk , tcp_write_queue_head ( sk ) , 1 ) > 0 ) {
2005-04-17 02:20:36 +04:00
/* Retransmission failed because of local congestion,
* do not backoff .
*/
2005-08-10 07:10:42 +04:00
if ( ! icsk - > icsk_retransmits )
icsk - > icsk_retransmits = 1 ;
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
2005-08-10 07:11:08 +04:00
min ( icsk - > icsk_rto , TCP_RESOURCE_PROBE_INTERVAL ) ,
TCP_RTO_MAX ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
/* Increase the timeout each time we retransmit. Note that
* we do not increase the rtt estimate . rto is initialized
* from rtt , but increases here . Jacobson ( SIGCOMM 88 ) suggests
* that doubling rto each time is the least we can get away with .
* In KA9Q , Karn uses this for the first few times , and then
* goes to quadratic . netBSD doubles , but only goes up to * 64 ,
* and clamps at 1 to 64 sec afterwards . Note that 120 sec is
* defined in the protocol as the maximum possible RTT . I guess
* we ' ll have to use something other than TCP to talk to the
* University of Mars .
*
* PAWS allows us longer timeouts and large windows , so once
* implemented ftp to mars will work nicely . We will have to fix
* the 120 second clamps though !
*/
2005-08-10 07:10:42 +04:00
icsk - > icsk_backoff + + ;
icsk - > icsk_retransmits + + ;
2005-04-17 02:20:36 +04:00
out_reset_timer :
2010-02-18 05:47:01 +03:00
/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
* used to reset timer , set to 0. Recalculate ' icsk_rto ' as this
* might be increased if the stream oscillates between thin and thick ,
* thus the old value might already be too high compared to the value
* set by ' tcp_set_rto ' in tcp_input . c which resets the rto without
* backoff . Limit to TCP_THIN_LINEAR_RETRIES before initiating
* exponential backoff behaviour to avoid continue hammering
* linear - timeout retransmissions into a black hole
*/
if ( sk - > sk_state = = TCP_ESTABLISHED & &
( tp - > thin_lto | | sysctl_tcp_thin_linear_timeouts ) & &
tcp_stream_is_thin ( tp ) & &
icsk - > icsk_retransmits < = TCP_THIN_LINEAR_RETRIES ) {
icsk - > icsk_backoff = 0 ;
icsk - > icsk_rto = min ( __tcp_set_rto ( tp ) , TCP_RTO_MAX ) ;
} else {
/* Use normal (exponential) backoff */
icsk - > icsk_rto = min ( icsk - > icsk_rto < < 1 , TCP_RTO_MAX ) ;
}
2005-08-10 07:11:08 +04:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS , icsk - > icsk_rto , TCP_RTO_MAX ) ;
2016-02-03 10:46:53 +03:00
if ( retransmits_timed_out ( sk , net - > ipv4 . sysctl_tcp_retries1 + 1 , 0 , 0 ) )
2005-04-17 02:20:36 +04:00
__sk_dst_reset ( sk ) ;
out : ;
}
2016-07-16 05:04:34 +03:00
/* Called with bottom-half processing disabled.
Called by tcp_write_timer ( ) */
2012-07-20 09:45:50 +04:00
void tcp_write_timer_handler ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2005-08-10 07:10:42 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
int event ;
2017-03-04 01:08:21 +03:00
if ( ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) | |
! icsk - > icsk_pending )
2005-04-17 02:20:36 +04:00
goto out ;
2005-08-10 07:10:42 +04:00
if ( time_after ( icsk - > icsk_timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_retransmit_timer , icsk - > icsk_timeout ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
2017-05-17 00:00:00 +03:00
skb_mstamp_get ( & tcp_sk ( sk ) - > tcp_mstamp ) ;
2005-08-10 07:10:42 +04:00
event = icsk - > icsk_pending ;
2005-04-17 02:20:36 +04:00
switch ( event ) {
2017-01-13 09:11:33 +03:00
case ICSK_TIME_REO_TIMEOUT :
tcp_rack_reo_timeout ( sk ) ;
break ;
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 14:00:43 +04:00
case ICSK_TIME_LOSS_PROBE :
tcp_send_loss_probe ( sk ) ;
break ;
2005-08-10 07:10:42 +04:00
case ICSK_TIME_RETRANS :
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 14:00:43 +04:00
icsk - > icsk_pending = 0 ;
2005-04-17 02:20:36 +04:00
tcp_retransmit_timer ( sk ) ;
break ;
2005-08-10 07:10:42 +04:00
case ICSK_TIME_PROBE0 :
tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-11 14:00:43 +04:00
icsk - > icsk_pending = 0 ;
2005-04-17 02:20:36 +04:00
tcp_probe_timer ( sk ) ;
break ;
}
out :
2007-12-31 11:11:19 +03:00
sk_mem_reclaim ( sk ) ;
2012-07-20 09:45:50 +04:00
}
static void tcp_write_timer ( unsigned long data )
{
struct sock * sk = ( struct sock * ) data ;
bh_lock_sock ( sk ) ;
if ( ! sock_owned_by_user ( sk ) ) {
tcp_write_timer_handler ( sk ) ;
} else {
2016-07-16 05:04:34 +03:00
/* delegate our work to tcp_release_cb() */
2016-12-03 22:14:57 +03:00
if ( ! test_and_set_bit ( TCP_WRITE_TIMER_DEFERRED , & sk - > sk_tsq_flags ) )
tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 04:22:46 +04:00
sock_hold ( sk ) ;
2012-07-20 09:45:50 +04:00
}
2005-04-17 02:20:36 +04:00
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
2015-03-22 20:22:19 +03:00
void tcp_syn_ack_timeout ( const struct request_sock * req )
2010-01-18 06:09:39 +03:00
{
2015-03-22 20:22:19 +03:00
struct net * net = read_pnet ( & inet_rsk ( req ) - > ireq_net ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( net , LINUX_MIB_TCPTIMEOUTS ) ;
2010-01-18 06:09:39 +03:00
}
EXPORT_SYMBOL ( tcp_syn_ack_timeout ) ;
2005-04-17 02:20:36 +04:00
void tcp_set_keepalive ( struct sock * sk , int val )
{
if ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) )
return ;
if ( val & & ! sock_flag ( sk , SOCK_KEEPOPEN ) )
2005-08-10 07:10:42 +04:00
inet_csk_reset_keepalive_timer ( sk , keepalive_time_when ( tcp_sk ( sk ) ) ) ;
2005-04-17 02:20:36 +04:00
else if ( ! val )
2005-08-10 07:10:42 +04:00
inet_csk_delete_keepalive_timer ( sk ) ;
2005-04-17 02:20:36 +04:00
}
2017-01-09 18:55:12 +03:00
EXPORT_SYMBOL_GPL ( tcp_set_keepalive ) ;
2005-04-17 02:20:36 +04:00
static void tcp_keepalive_timer ( unsigned long data )
{
struct sock * sk = ( struct sock * ) data ;
2005-08-10 11:03:31 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2010-04-26 22:33:27 +04:00
u32 elapsed ;
2005-04-17 02:20:36 +04:00
/* Only process if socket is not in use. */
bh_lock_sock ( sk ) ;
if ( sock_owned_by_user ( sk ) ) {
2007-02-09 17:24:47 +03:00
/* Try again later. */
2005-08-10 07:10:42 +04:00
inet_csk_reset_keepalive_timer ( sk , HZ / 20 ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
if ( sk - > sk_state = = TCP_LISTEN ) {
inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 05:04:20 +03:00
pr_err ( " Hmm... keepalive on a LISTEN ??? \n " ) ;
2005-04-17 02:20:36 +04:00
goto out ;
}
if ( sk - > sk_state = = TCP_FIN_WAIT2 & & sock_flag ( sk , SOCK_DEAD ) ) {
if ( tp - > linger2 > = 0 ) {
2005-08-10 07:10:42 +04:00
const int tmo = tcp_fin_time ( sk ) - TCP_TIMEWAIT_LEN ;
2005-04-17 02:20:36 +04:00
if ( tmo > 0 ) {
tcp_time_wait ( sk , TCP_FIN_WAIT2 , tmo ) ;
goto out ;
}
}
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
goto death ;
}
if ( ! sock_flag ( sk , SOCK_KEEPOPEN ) | | sk - > sk_state = = TCP_CLOSE )
goto out ;
elapsed = keepalive_time_when ( tp ) ;
/* It is alive without keepalive 8) */
2007-03-07 23:12:44 +03:00
if ( tp - > packets_out | | tcp_send_head ( sk ) )
2005-04-17 02:20:36 +04:00
goto resched ;
2010-04-26 22:33:27 +04:00
elapsed = keepalive_time_elapsed ( tp ) ;
2005-04-17 02:20:36 +04:00
if ( elapsed > = keepalive_time_when ( tp ) ) {
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
/* If the TCP_USER_TIMEOUT option is enabled, use that
* to determine when to timeout instead .
*/
if ( ( icsk - > icsk_user_timeout ! = 0 & &
elapsed > = icsk - > icsk_user_timeout & &
icsk - > icsk_probes_out > 0 ) | |
( icsk - > icsk_user_timeout = = 0 & &
icsk - > icsk_probes_out > = keepalive_probes ( tp ) ) ) {
2005-04-17 02:20:36 +04:00
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_write_err ( sk ) ;
goto out ;
}
2015-05-07 00:26:25 +03:00
if ( tcp_write_wakeup ( sk , LINUX_MIB_TCPKEEPALIVE ) < = 0 ) {
2005-08-10 11:03:31 +04:00
icsk - > icsk_probes_out + + ;
2005-04-17 02:20:36 +04:00
elapsed = keepalive_intvl_when ( tp ) ;
} else {
/* If keepalive was lost due to local congestion,
* try harder .
*/
elapsed = TCP_RESOURCE_PROBE_INTERVAL ;
}
} else {
/* It is tp->rcv_tstamp + keepalive_time_when(tp) */
elapsed = keepalive_time_when ( tp ) - elapsed ;
}
2007-12-31 11:11:19 +03:00
sk_mem_reclaim ( sk ) ;
2005-04-17 02:20:36 +04:00
resched :
2005-08-10 07:10:42 +04:00
inet_csk_reset_keepalive_timer ( sk , elapsed ) ;
2005-04-17 02:20:36 +04:00
goto out ;
2007-02-09 17:24:47 +03:00
death :
2005-04-17 02:20:36 +04:00
tcp_done ( sk ) ;
out :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
2012-07-20 09:45:50 +04:00
void tcp_init_xmit_timers ( struct sock * sk )
{
inet_csk_init_xmit_timers ( sk , & tcp_write_timer , & tcp_delack_timer ,
& tcp_keepalive_timer ) ;
2017-05-16 14:24:36 +03:00
hrtimer_init ( & tcp_sk ( sk ) - > pacing_timer , CLOCK_MONOTONIC ,
HRTIMER_MODE_ABS_PINNED ) ;
tcp_sk ( sk ) - > pacing_timer . function = tcp_pace_kick ;
2012-07-20 09:45:50 +04:00
}