492135557d
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
45 lines
1.6 KiB
Plaintext
45 lines
1.6 KiB
Plaintext
DCTCP (DataCenter TCP)
|
|
----------------------
|
|
|
|
DCTCP is an enhancement to the TCP congestion control algorithm for data
|
|
center networks and leverages Explicit Congestion Notification (ECN) in
|
|
the data center network to provide multi-bit feedback to the end hosts.
|
|
|
|
To enable it on end hosts:
|
|
|
|
sysctl -w net.ipv4.tcp_congestion_control=dctcp
|
|
sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional)
|
|
|
|
All switches in the data center network running DCTCP must support ECN
|
|
marking and be configured for marking when reaching defined switch buffer
|
|
thresholds. The default ECN marking threshold heuristic for DCTCP on
|
|
switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps,
|
|
but might need further careful tweaking.
|
|
|
|
For more details, see below documents:
|
|
|
|
Paper:
|
|
|
|
The algorithm is further described in detail in the following two
|
|
SIGCOMM/SIGMETRICS papers:
|
|
|
|
i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
|
|
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
|
|
"Data Center TCP (DCTCP)", Data Center Networks session
|
|
Proc. ACM SIGCOMM, New Delhi, 2010.
|
|
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
|
|
http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192
|
|
|
|
ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
|
|
"Analysis of DCTCP: Stability, Convergence, and Fairness"
|
|
Proc. ACM SIGMETRICS, San Jose, 2011.
|
|
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
|
|
|
|
IETF informational draft:
|
|
|
|
http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
|
|
|
|
DCTCP site:
|
|
|
|
http://simula.stanford.edu/~alizade/Site/DCTCP.html
|