2005-04-16 15:20:36 -07:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Implementation of the Transmission Control Protocol ( TCP ) .
*
2005-05-05 16:16:16 -07:00
* Authors : Ross Biro
2005-04-16 15:20:36 -07:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Mark Evans , < evansmp @ uhura . aston . ac . uk >
* Corey Minyard < wf - rch ! minyard @ relay . EU . net >
* Florian La Roche , < flla @ stud . uni - sb . de >
* Charles Hedrick , < hedrick @ klinzhai . rutgers . edu >
* Linus Torvalds , < torvalds @ cs . helsinki . fi >
* Alan Cox , < gw4pts @ gw4pts . ampr . org >
* Matthew Dillon , < dillon @ apollo . west . oic . com >
* Arnt Gulbrandsen , < agulbra @ nvg . unit . no >
* Jorge Cwik , < jorge @ laser . satlink . net >
*/
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/gfp.h>
2005-04-16 15:20:36 -07:00
# include <net/tcp.h>
2006-09-22 14:15:41 -07:00
int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES ;
int sysctl_tcp_synack_retries __read_mostly = TCP_SYNACK_RETRIES ;
int sysctl_tcp_keepalive_time __read_mostly = TCP_KEEPALIVE_TIME ;
int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES ;
int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL ;
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1 ;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2 ;
int sysctl_tcp_orphan_retries __read_mostly ;
2010-02-18 02:47:01 +00:00
int sysctl_tcp_thin_linear_timeouts __read_mostly ;
2005-04-16 15:20:36 -07:00
static void tcp_write_timer ( unsigned long ) ;
static void tcp_delack_timer ( unsigned long ) ;
static void tcp_keepalive_timer ( unsigned long data ) ;
2005-08-09 20:10:42 -07:00
void tcp_init_xmit_timers ( struct sock * sk )
{
inet_csk_init_xmit_timers ( sk , & tcp_write_timer , & tcp_delack_timer ,
& tcp_keepalive_timer ) ;
}
2005-08-09 20:11:08 -07:00
EXPORT_SYMBOL ( tcp_init_xmit_timers ) ;
2005-04-16 15:20:36 -07:00
static void tcp_write_err ( struct sock * sk )
{
sk - > sk_err = sk - > sk_err_soft ? : ETIMEDOUT ;
sk - > sk_error_report ( sk ) ;
tcp_done ( sk ) ;
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_TCPABORTONTIMEOUT ) ;
2005-04-16 15:20:36 -07:00
}
/* Do not allow orphaned sockets to eat all our resources.
* This is direct violation of TCP specs , but it is required
* to prevent DoS attacks . It is called when a retransmission timeout
* or zero probe timeout occurs on orphaned socket .
*
2005-11-10 17:13:47 -08:00
* Criteria is still not confirmed experimentally and may change .
2005-04-16 15:20:36 -07:00
* We kill the socket , if :
* 1. If number of orphaned sockets exceeds an administratively configured
* limit .
* 2. If we have strong memory pressure .
*/
static int tcp_out_of_resources ( struct sock * sk , int do_reset )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2008-11-25 21:17:14 -08:00
int orphans = percpu_counter_read_positive ( & tcp_orphan_count ) ;
2005-04-16 15:20:36 -07:00
2007-02-09 23:24:47 +09:00
/* If peer does not open window for long time, or did not transmit
2005-04-16 15:20:36 -07:00
* anything for long time , penalize it . */
if ( ( s32 ) ( tcp_time_stamp - tp - > lsndtime ) > 2 * TCP_RTO_MAX | | ! do_reset )
orphans < < = 1 ;
/* If some dubious ICMP arrived, penalize even more. */
if ( sk - > sk_err_soft )
orphans < < = 1 ;
2007-05-29 13:19:18 -07:00
if ( tcp_too_many_orphans ( sk , orphans ) ) {
2005-04-16 15:20:36 -07:00
if ( net_ratelimit ( ) )
printk ( KERN_INFO " Out of socket memory \n " ) ;
/* Catch exceptional cases, when connection requires reset.
* 1. Last segment was sent recently . */
if ( ( s32 ) ( tcp_time_stamp - tp - > lsndtime ) < = TCP_TIMEWAIT_LEN | |
/* 2. Window is closed. */
( ! tp - > snd_wnd & & ! tp - > packets_out ) )
do_reset = 1 ;
if ( do_reset )
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_done ( sk ) ;
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_TCPABORTONMEMORY ) ;
2005-04-16 15:20:36 -07:00
return 1 ;
}
return 0 ;
}
/* Calculate maximal number or retries on an orphaned socket. */
static int tcp_orphan_retries ( struct sock * sk , int alive )
{
int retries = sysctl_tcp_orphan_retries ; /* May be zero. */
/* We know from an ICMP that something is wrong. */
if ( sk - > sk_err_soft & & ! alive )
retries = 0 ;
/* However, if socket sent something recently, select some safe
* number of retries . 8 corresponds to > 100 seconds with minimal
* RTO of 200 msec . */
if ( retries = = 0 & & alive )
retries = 8 ;
return retries ;
}
2007-12-21 01:50:43 -08:00
static void tcp_mtu_probing ( struct inet_connection_sock * icsk , struct sock * sk )
{
/* Black hole detection */
if ( sysctl_tcp_mtu_probing ) {
if ( ! icsk - > icsk_mtup . enabled ) {
icsk - > icsk_mtup . enabled = 1 ;
tcp_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
} else {
struct tcp_sock * tp = tcp_sk ( sk ) ;
2007-12-21 04:29:16 -08:00
int mss ;
2007-12-21 05:58:29 -08:00
mss = tcp_mtu_to_mss ( sk , icsk - > icsk_mtup . search_low ) > > 1 ;
2007-12-21 01:50:43 -08:00
mss = min ( sysctl_tcp_base_mss , mss ) ;
mss = max ( mss , 68 - tp - > tcp_header_len ) ;
icsk - > icsk_mtup . search_low = tcp_mss_to_mtu ( sk , mss ) ;
tcp_sync_mss ( sk , icsk - > icsk_pmtu_cookie ) ;
}
}
}
2009-12-07 06:06:16 +00:00
/* This function calculates a "timeout" which is equivalent to the timeout of a
tree-wide: Assorted spelling fixes
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-03 08:01:28 +08:00
* TCP connection after " boundary " unsuccessful , exponentially backed - off
2009-12-07 06:06:16 +00:00
* retransmissions with an initial RTO of TCP_RTO_MIN .
*/
static bool retransmits_timed_out ( struct sock * sk ,
unsigned int boundary )
{
unsigned int timeout , linear_backoff_thresh ;
unsigned int start_ts ;
if ( ! inet_csk ( sk ) - > icsk_retransmits )
return false ;
if ( unlikely ( ! tcp_sk ( sk ) - > retrans_stamp ) )
start_ts = TCP_SKB_CB ( tcp_write_queue_head ( sk ) ) - > when ;
else
start_ts = tcp_sk ( sk ) - > retrans_stamp ;
linear_backoff_thresh = ilog2 ( TCP_RTO_MAX / TCP_RTO_MIN ) ;
if ( boundary < = linear_backoff_thresh )
timeout = ( ( 2 < < boundary ) - 1 ) * TCP_RTO_MIN ;
else
timeout = ( ( 2 < < linear_backoff_thresh ) - 1 ) * TCP_RTO_MIN +
( boundary - linear_backoff_thresh ) * TCP_RTO_MAX ;
return ( tcp_time_stamp - start_ts ) > = timeout ;
}
2005-04-16 15:20:36 -07:00
/* A write timeout has occurred. Process the after effects. */
static int tcp_write_timeout ( struct sock * sk )
{
2006-03-20 17:53:41 -08:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
int retry_until ;
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
bool do_reset ;
2005-04-16 15:20:36 -07:00
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) {
2005-08-09 20:10:42 -07:00
if ( icsk - > icsk_retransmits )
2010-04-08 23:03:29 +00:00
dst_negative_advice ( sk ) ;
2005-08-09 20:10:42 -07:00
retry_until = icsk - > icsk_syn_retries ? : sysctl_tcp_syn_retries ;
2005-04-16 15:20:36 -07:00
} else {
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
if ( retransmits_timed_out ( sk , sysctl_tcp_retries1 ) ) {
2006-03-20 17:53:41 -08:00
/* Black hole detection */
2007-12-21 01:50:43 -08:00
tcp_mtu_probing ( icsk , sk ) ;
2005-04-16 15:20:36 -07:00
2010-04-08 23:03:29 +00:00
dst_negative_advice ( sk ) ;
2005-04-16 15:20:36 -07:00
}
retry_until = sysctl_tcp_retries2 ;
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2005-08-09 20:10:42 -07:00
const int alive = ( icsk - > icsk_rto < TCP_RTO_MAX ) ;
2007-02-09 23:24:47 +09:00
2005-04-16 15:20:36 -07:00
retry_until = tcp_orphan_retries ( sk , alive ) ;
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
do_reset = alive | |
! retransmits_timed_out ( sk , retry_until ) ;
2005-04-16 15:20:36 -07:00
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
if ( tcp_out_of_resources ( sk , do_reset ) )
2005-04-16 15:20:36 -07:00
return 1 ;
}
}
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
if ( retransmits_timed_out ( sk , retry_until ) ) {
2005-04-16 15:20:36 -07:00
/* Has it gone just too far? */
tcp_write_err ( sk ) ;
return 1 ;
}
return 0 ;
}
static void tcp_delack_timer ( unsigned long data )
{
2008-11-03 02:47:38 -08:00
struct sock * sk = ( struct sock * ) data ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
bh_lock_sock ( sk ) ;
if ( sock_owned_by_user ( sk ) ) {
/* Try again later. */
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . blocked = 1 ;
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKLOCKED ) ;
2005-08-09 20:10:42 -07:00
sk_reset_timer ( sk , & icsk - > icsk_delack_timer , jiffies + TCP_DELACK_MIN ) ;
2005-04-16 15:20:36 -07:00
goto out_unlock ;
}
2008-01-10 21:56:38 -08:00
sk_mem_reclaim_partial ( sk ) ;
2005-04-16 15:20:36 -07:00
2005-08-09 20:10:42 -07:00
if ( sk - > sk_state = = TCP_CLOSE | | ! ( icsk - > icsk_ack . pending & ICSK_ACK_TIMER ) )
2005-04-16 15:20:36 -07:00
goto out ;
2005-08-09 20:10:42 -07:00
if ( time_after ( icsk - > icsk_ack . timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_delack_timer , icsk - > icsk_ack . timeout ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . pending & = ~ ICSK_ACK_TIMER ;
2005-04-16 15:20:36 -07:00
2005-07-08 14:57:23 -07:00
if ( ! skb_queue_empty ( & tp - > ucopy . prequeue ) ) {
2005-04-16 15:20:36 -07:00
struct sk_buff * skb ;
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_TCPSCHEDULERFAILED ) ;
2005-04-16 15:20:36 -07:00
while ( ( skb = __skb_dequeue ( & tp - > ucopy . prequeue ) ) ! = NULL )
2008-10-07 14:18:42 -07:00
sk_backlog_rcv ( sk , skb ) ;
2005-04-16 15:20:36 -07:00
tp - > ucopy . memory = 0 ;
}
2005-08-09 20:10:42 -07:00
if ( inet_csk_ack_scheduled ( sk ) ) {
if ( ! icsk - > icsk_ack . pingpong ) {
2005-04-16 15:20:36 -07:00
/* Delayed ACK missed: inflate ATO. */
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . ato = min ( icsk - > icsk_ack . ato < < 1 , icsk - > icsk_rto ) ;
2005-04-16 15:20:36 -07:00
} else {
/* Delayed ACK missed: leave pingpong mode and
* deflate ATO .
*/
2005-08-09 20:10:42 -07:00
icsk - > icsk_ack . pingpong = 0 ;
icsk - > icsk_ack . ato = TCP_ATO_MIN ;
2005-04-16 15:20:36 -07:00
}
tcp_send_ack ( sk ) ;
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_DELAYEDACKS ) ;
2005-04-16 15:20:36 -07:00
}
TCP_CHECK_TIMER ( sk ) ;
out :
if ( tcp_memory_pressure )
2007-12-31 00:11:19 -08:00
sk_mem_reclaim ( sk ) ;
2005-04-16 15:20:36 -07:00
out_unlock :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
static void tcp_probe_timer ( struct sock * sk )
{
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
int max_probes ;
2007-03-07 12:12:44 -08:00
if ( tp - > packets_out | | ! tcp_send_head ( sk ) ) {
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out = 0 ;
2005-04-16 15:20:36 -07:00
return ;
}
/* *WARNING* RFC 1122 forbids this
*
* It doesn ' t AFAIK , because we kill the retransmit timer - AK
*
* FIXME : We ought not to do it , Solaris 2.5 actually has fixing
* this behaviour in Solaris down as a bug fix . [ AC ]
*
2005-08-10 04:03:31 -03:00
* Let me to explain . icsk_probes_out is zeroed by incoming ACKs
2005-04-16 15:20:36 -07:00
* even if they advertise zero window . Hence , connection is killed only
* if we received no ACKs for normal connection timeout . It is not killed
* only because window stays zero for some time , window may be zero
* until armageddon and even later . We are in full accordance
* with RFCs , only probe timer combines both retransmission timeout
* and probe timeout in one bottle . - - ANK
*/
max_probes = sysctl_tcp_retries2 ;
if ( sock_flag ( sk , SOCK_DEAD ) ) {
2005-08-09 20:10:42 -07:00
const int alive = ( ( icsk - > icsk_rto < < icsk - > icsk_backoff ) < TCP_RTO_MAX ) ;
2007-02-09 23:24:47 +09:00
2005-04-16 15:20:36 -07:00
max_probes = tcp_orphan_retries ( sk , alive ) ;
2005-08-10 04:03:31 -03:00
if ( tcp_out_of_resources ( sk , alive | | icsk - > icsk_probes_out < = max_probes ) )
2005-04-16 15:20:36 -07:00
return ;
}
2005-08-10 04:03:31 -03:00
if ( icsk - > icsk_probes_out > max_probes ) {
2005-04-16 15:20:36 -07:00
tcp_write_err ( sk ) ;
} else {
/* Only send another probe if we didn't close things up. */
tcp_send_probe0 ( sk ) ;
}
}
/*
* The TCP retransmit timer .
*/
Revert Backoff [v3]: Revert RTO on ICMP destination unreachable
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.
This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.
Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:31 +00:00
void tcp_retransmit_timer ( struct sock * sk )
2005-04-16 15:20:36 -07:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
if ( ! tp - > packets_out )
goto out ;
2008-07-25 21:43:18 -07:00
WARN_ON ( tcp_write_queue_empty ( sk ) ) ;
2005-04-16 15:20:36 -07:00
if ( ! tp - > snd_wnd & & ! sock_flag ( sk , SOCK_DEAD ) & &
! ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) ) ) {
/* Receiver dastardly shrinks window. Our retransmits
* become zero probes , but we should not timeout this
* connection . If the socket is an orphan , time it out ,
* we cannot allow such beasts to hang infinitely .
*/
# ifdef TCP_DEBUG
2008-04-14 04:09:36 -07:00
struct inet_sock * inet = inet_sk ( sk ) ;
if ( sk - > sk_family = = AF_INET ) {
2008-12-18 19:54:22 -08:00
LIMIT_NETDEBUG ( KERN_DEBUG " TCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired) \n " ,
2009-10-15 06:30:45 +00:00
& inet - > inet_daddr , ntohs ( inet - > inet_dport ) ,
inet - > inet_num , tp - > snd_una , tp - > snd_nxt ) ;
2005-04-16 15:20:36 -07:00
}
2008-04-14 04:09:36 -07:00
# if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
else if ( sk - > sk_family = = AF_INET6 ) {
struct ipv6_pinfo * np = inet6_sk ( sk ) ;
2008-12-18 19:54:22 -08:00
LIMIT_NETDEBUG ( KERN_DEBUG " TCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired) \n " ,
2009-10-15 06:30:45 +00:00
& np - > daddr , ntohs ( inet - > inet_dport ) ,
inet - > inet_num , tp - > snd_una , tp - > snd_nxt ) ;
2008-04-14 04:09:36 -07:00
}
# endif
2005-04-16 15:20:36 -07:00
# endif
if ( tcp_time_stamp - tp - > rcv_tstamp > TCP_RTO_MAX ) {
tcp_write_err ( sk ) ;
goto out ;
}
tcp_enter_loss ( sk , 0 ) ;
2007-03-07 12:12:44 -08:00
tcp_retransmit_skb ( sk , tcp_write_queue_head ( sk ) ) ;
2005-04-16 15:20:36 -07:00
__sk_dst_reset ( sk ) ;
goto out_reset_timer ;
}
if ( tcp_write_timeout ( sk ) )
goto out ;
2005-08-09 20:10:42 -07:00
if ( icsk - > icsk_retransmits = = 0 ) {
2008-07-03 01:05:41 -07:00
int mib_idx ;
2009-02-28 04:44:34 +00:00
if ( icsk - > icsk_ca_state = = TCP_CA_Disorder ) {
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKFAILURES ;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES ;
} else if ( icsk - > icsk_ca_state = = TCP_CA_Recovery ) {
if ( tcp_is_sack ( tp ) )
mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL ;
else
mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL ;
2005-08-10 04:03:31 -03:00
} else if ( icsk - > icsk_ca_state = = TCP_CA_Loss ) {
2008-07-03 01:05:41 -07:00
mib_idx = LINUX_MIB_TCPLOSSFAILURES ;
2005-04-16 15:20:36 -07:00
} else {
2008-07-03 01:05:41 -07:00
mib_idx = LINUX_MIB_TCPTIMEOUTS ;
2005-04-16 15:20:36 -07:00
}
2008-07-16 20:31:16 -07:00
NET_INC_STATS_BH ( sock_net ( sk ) , mib_idx ) ;
2005-04-16 15:20:36 -07:00
}
if ( tcp_use_frto ( sk ) ) {
tcp_enter_frto ( sk ) ;
} else {
tcp_enter_loss ( sk , 0 ) ;
}
2007-03-07 12:12:44 -08:00
if ( tcp_retransmit_skb ( sk , tcp_write_queue_head ( sk ) ) > 0 ) {
2005-04-16 15:20:36 -07:00
/* Retransmission failed because of local congestion,
* do not backoff .
*/
2005-08-09 20:10:42 -07:00
if ( ! icsk - > icsk_retransmits )
icsk - > icsk_retransmits = 1 ;
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS ,
2005-08-09 20:11:08 -07:00
min ( icsk - > icsk_rto , TCP_RESOURCE_PROBE_INTERVAL ) ,
TCP_RTO_MAX ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
/* Increase the timeout each time we retransmit. Note that
* we do not increase the rtt estimate . rto is initialized
* from rtt , but increases here . Jacobson ( SIGCOMM 88 ) suggests
* that doubling rto each time is the least we can get away with .
* In KA9Q , Karn uses this for the first few times , and then
* goes to quadratic . netBSD doubles , but only goes up to * 64 ,
* and clamps at 1 to 64 sec afterwards . Note that 120 sec is
* defined in the protocol as the maximum possible RTT . I guess
* we ' ll have to use something other than TCP to talk to the
* University of Mars .
*
* PAWS allows us longer timeouts and large windows , so once
* implemented ftp to mars will work nicely . We will have to fix
* the 120 second clamps though !
*/
2005-08-09 20:10:42 -07:00
icsk - > icsk_backoff + + ;
icsk - > icsk_retransmits + + ;
2005-04-16 15:20:36 -07:00
out_reset_timer :
2010-02-18 02:47:01 +00:00
/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
* used to reset timer , set to 0. Recalculate ' icsk_rto ' as this
* might be increased if the stream oscillates between thin and thick ,
* thus the old value might already be too high compared to the value
* set by ' tcp_set_rto ' in tcp_input . c which resets the rto without
* backoff . Limit to TCP_THIN_LINEAR_RETRIES before initiating
* exponential backoff behaviour to avoid continue hammering
* linear - timeout retransmissions into a black hole
*/
if ( sk - > sk_state = = TCP_ESTABLISHED & &
( tp - > thin_lto | | sysctl_tcp_thin_linear_timeouts ) & &
tcp_stream_is_thin ( tp ) & &
icsk - > icsk_retransmits < = TCP_THIN_LINEAR_RETRIES ) {
icsk - > icsk_backoff = 0 ;
icsk - > icsk_rto = min ( __tcp_set_rto ( tp ) , TCP_RTO_MAX ) ;
} else {
/* Use normal (exponential) backoff */
icsk - > icsk_rto = min ( icsk - > icsk_rto < < 1 , TCP_RTO_MAX ) ;
}
2005-08-09 20:11:08 -07:00
inet_csk_reset_xmit_timer ( sk , ICSK_TIME_RETRANS , icsk - > icsk_rto , TCP_RTO_MAX ) ;
Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.
For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.
However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.
This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.
The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-26 00:16:34 +00:00
if ( retransmits_timed_out ( sk , sysctl_tcp_retries1 + 1 ) )
2005-04-16 15:20:36 -07:00
__sk_dst_reset ( sk ) ;
out : ;
}
static void tcp_write_timer ( unsigned long data )
{
2008-11-03 02:47:38 -08:00
struct sock * sk = ( struct sock * ) data ;
2005-08-09 20:10:42 -07:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
int event ;
bh_lock_sock ( sk ) ;
if ( sock_owned_by_user ( sk ) ) {
/* Try again later */
2005-08-09 20:10:42 -07:00
sk_reset_timer ( sk , & icsk - > icsk_retransmit_timer , jiffies + ( HZ / 20 ) ) ;
2005-04-16 15:20:36 -07:00
goto out_unlock ;
}
2005-08-09 20:10:42 -07:00
if ( sk - > sk_state = = TCP_CLOSE | | ! icsk - > icsk_pending )
2005-04-16 15:20:36 -07:00
goto out ;
2005-08-09 20:10:42 -07:00
if ( time_after ( icsk - > icsk_timeout , jiffies ) ) {
sk_reset_timer ( sk , & icsk - > icsk_retransmit_timer , icsk - > icsk_timeout ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
2005-08-09 20:10:42 -07:00
event = icsk - > icsk_pending ;
icsk - > icsk_pending = 0 ;
2005-04-16 15:20:36 -07:00
switch ( event ) {
2005-08-09 20:10:42 -07:00
case ICSK_TIME_RETRANS :
2005-04-16 15:20:36 -07:00
tcp_retransmit_timer ( sk ) ;
break ;
2005-08-09 20:10:42 -07:00
case ICSK_TIME_PROBE0 :
2005-04-16 15:20:36 -07:00
tcp_probe_timer ( sk ) ;
break ;
}
TCP_CHECK_TIMER ( sk ) ;
out :
2007-12-31 00:11:19 -08:00
sk_mem_reclaim ( sk ) ;
2005-04-16 15:20:36 -07:00
out_unlock :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}
2005-08-09 20:11:56 -07:00
/*
* Timer for listening sockets
*/
static void tcp_synack_timer ( struct sock * sk )
{
2005-08-09 20:15:09 -07:00
inet_csk_reqsk_queue_prune ( sk , TCP_SYNQ_INTERVAL ,
TCP_TIMEOUT_INIT , TCP_RTO_MAX ) ;
2005-04-16 15:20:36 -07:00
}
2010-01-17 19:09:39 -08:00
void tcp_syn_ack_timeout ( struct sock * sk , struct request_sock * req )
{
NET_INC_STATS_BH ( sock_net ( sk ) , LINUX_MIB_TCPTIMEOUTS ) ;
}
EXPORT_SYMBOL ( tcp_syn_ack_timeout ) ;
2005-04-16 15:20:36 -07:00
void tcp_set_keepalive ( struct sock * sk , int val )
{
if ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) )
return ;
if ( val & & ! sock_flag ( sk , SOCK_KEEPOPEN ) )
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , keepalive_time_when ( tcp_sk ( sk ) ) ) ;
2005-04-16 15:20:36 -07:00
else if ( ! val )
2005-08-09 20:10:42 -07:00
inet_csk_delete_keepalive_timer ( sk ) ;
2005-04-16 15:20:36 -07:00
}
static void tcp_keepalive_timer ( unsigned long data )
{
struct sock * sk = ( struct sock * ) data ;
2005-08-10 04:03:31 -03:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-16 15:20:36 -07:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2010-04-26 18:33:27 +00:00
u32 elapsed ;
2005-04-16 15:20:36 -07:00
/* Only process if socket is not in use. */
bh_lock_sock ( sk ) ;
if ( sock_owned_by_user ( sk ) ) {
2007-02-09 23:24:47 +09:00
/* Try again later. */
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , HZ / 20 ) ;
2005-04-16 15:20:36 -07:00
goto out ;
}
if ( sk - > sk_state = = TCP_LISTEN ) {
tcp_synack_timer ( sk ) ;
goto out ;
}
if ( sk - > sk_state = = TCP_FIN_WAIT2 & & sock_flag ( sk , SOCK_DEAD ) ) {
if ( tp - > linger2 > = 0 ) {
2005-08-09 20:10:42 -07:00
const int tmo = tcp_fin_time ( sk ) - TCP_TIMEWAIT_LEN ;
2005-04-16 15:20:36 -07:00
if ( tmo > 0 ) {
tcp_time_wait ( sk , TCP_FIN_WAIT2 , tmo ) ;
goto out ;
}
}
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
goto death ;
}
if ( ! sock_flag ( sk , SOCK_KEEPOPEN ) | | sk - > sk_state = = TCP_CLOSE )
goto out ;
elapsed = keepalive_time_when ( tp ) ;
/* It is alive without keepalive 8) */
2007-03-07 12:12:44 -08:00
if ( tp - > packets_out | | tcp_send_head ( sk ) )
2005-04-16 15:20:36 -07:00
goto resched ;
2010-04-26 18:33:27 +00:00
elapsed = keepalive_time_elapsed ( tp ) ;
2005-04-16 15:20:36 -07:00
if ( elapsed > = keepalive_time_when ( tp ) ) {
2009-08-28 23:48:54 -07:00
if ( icsk - > icsk_probes_out > = keepalive_probes ( tp ) ) {
2005-04-16 15:20:36 -07:00
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_write_err ( sk ) ;
goto out ;
}
if ( tcp_write_wakeup ( sk ) < = 0 ) {
2005-08-10 04:03:31 -03:00
icsk - > icsk_probes_out + + ;
2005-04-16 15:20:36 -07:00
elapsed = keepalive_intvl_when ( tp ) ;
} else {
/* If keepalive was lost due to local congestion,
* try harder .
*/
elapsed = TCP_RESOURCE_PROBE_INTERVAL ;
}
} else {
/* It is tp->rcv_tstamp + keepalive_time_when(tp) */
elapsed = keepalive_time_when ( tp ) - elapsed ;
}
TCP_CHECK_TIMER ( sk ) ;
2007-12-31 00:11:19 -08:00
sk_mem_reclaim ( sk ) ;
2005-04-16 15:20:36 -07:00
resched :
2005-08-09 20:10:42 -07:00
inet_csk_reset_keepalive_timer ( sk , elapsed ) ;
2005-04-16 15:20:36 -07:00
goto out ;
2007-02-09 23:24:47 +09:00
death :
2005-04-16 15:20:36 -07:00
tcp_done ( sk ) ;
out :
bh_unlock_sock ( sk ) ;
sock_put ( sk ) ;
}