2019-05-27 09:55:01 +03:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-17 02:20:36 +04:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Implementation of the Transmission Control Protocol ( TCP ) .
*
2005-05-06 03:16:16 +04:00
* Authors : Ross Biro
2005-04-17 02:20:36 +04:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Mark Evans , < evansmp @ uhura . aston . ac . uk >
* Corey Minyard < wf - rch ! minyard @ relay . EU . net >
* Florian La Roche , < flla @ stud . uni - sb . de >
* Charles Hedrick , < hedrick @ klinzhai . rutgers . edu >
* Linus Torvalds , < torvalds @ cs . helsinki . fi >
* Alan Cox , < gw4pts @ gw4pts . ampr . org >
* Matthew Dillon , < dillon @ apollo . west . oic . com >
* Arnt Gulbrandsen , < agulbra @ nvg . unit . no >
* Jorge Cwik , < jorge @ laser . satlink . net >
*
* Fixes :
* Alan Cox : Numerous verify_area ( ) calls
* Alan Cox : Set the ACK bit on a reset
* Alan Cox : Stopped it crashing if it closed while
* sk - > inuse = 1 and was trying to connect
* ( tcp_err ( ) ) .
* Alan Cox : All icmp error handling was broken
* pointers passed where wrong and the
* socket was looked up backwards . Nobody
* tested any icmp error code obviously .
* Alan Cox : tcp_err ( ) now handled properly . It
* wakes people on errors . poll
* behaves and the icmp error race
* has gone by moving it into sock . c
* Alan Cox : tcp_send_reset ( ) fixed to work for
* everything not just packets for
* unknown sockets .
* Alan Cox : tcp option processing .
* Alan Cox : Reset tweaked ( still not 100 % ) [ Had
* syn rule wrong ]
* Herp Rosmanith : More reset fixes
* Alan Cox : No longer acks invalid rst frames .
* Acking any kind of RST is right out .
* Alan Cox : Sets an ignore me flag on an rst
* receive otherwise odd bits of prattle
* escape still
* Alan Cox : Fixed another acking RST frame bug .
* Should stop LAN workplace lockups .
* Alan Cox : Some tidyups using the new skb list
* facilities
* Alan Cox : sk - > keepopen now seems to work
* Alan Cox : Pulls options out correctly on accepts
* Alan Cox : Fixed assorted sk - > rqueue - > next errors
* Alan Cox : PSH doesn ' t end a TCP read . Switched a
* bit to skb ops .
* Alan Cox : Tidied tcp_data to avoid a potential
* nasty .
* Alan Cox : Added some better commenting , as the
* tcp is hard to follow
* Alan Cox : Removed incorrect check for 20 * psh
* Michael O ' Reilly : ack < copied bug fix .
* Johannes Stille : Misc tcp fixes ( not all in yet ) .
* Alan Cox : FIN with no memory - > CRASH
* Alan Cox : Added socket option proto entries .
* Also added awareness of them to accept .
* Alan Cox : Added TCP options ( SOL_TCP )
* Alan Cox : Switched wakeup calls to callbacks ,
* so the kernel can layer network
* sockets .
* Alan Cox : Use ip_tos / ip_ttl settings .
* Alan Cox : Handle FIN ( more ) properly ( we hope ) .
* Alan Cox : RST frames sent on unsynchronised
* state ack error .
* Alan Cox : Put in missing check for SYN bit .
* Alan Cox : Added tcp_select_window ( ) aka NET2E
* window non shrink trick .
* Alan Cox : Added a couple of small NET2E timer
* fixes
* Charles Hedrick : TCP fixes
* Toomas Tamm : TCP window fixes
* Alan Cox : Small URG fix to rlogin ^ C ack fight
* Charles Hedrick : Rewrote most of it to actually work
* Linus : Rewrote tcp_read ( ) and URG handling
* completely
* Gerhard Koerting : Fixed some missing timer handling
* Matthew Dillon : Reworked TCP machine states as per RFC
* Gerhard Koerting : PC / TCP workarounds
* Adam Caldwell : Assorted timer / timing errors
* Matthew Dillon : Fixed another RST bug
* Alan Cox : Move to kernel side addressing changes .
* Alan Cox : Beginning work on TCP fastpathing
* ( not yet usable )
* Arnt Gulbrandsen : Turbocharged tcp_check ( ) routine .
* Alan Cox : TCP fast path debugging
* Alan Cox : Window clamping
* Michael Riepe : Bug in tcp_check ( )
* Matt Dillon : More TCP improvements and RST bug fixes
* Matt Dillon : Yet more small nasties remove from the
* TCP code ( Be very nice to this man if
* tcp finally works 100 % ) 8 )
* Alan Cox : BSD accept semantics .
* Alan Cox : Reset on closedown bug .
* Peter De Schrijver : ENOTCONN check missing in tcp_sendto ( ) .
* Michael Pall : Handle poll ( ) after URG properly in
* all cases .
* Michael Pall : Undo the last fix in tcp_read_urg ( )
* ( multi URG PUSH broke rlogin ) .
* Michael Pall : Fix the multi URG PUSH problem in
* tcp_readable ( ) , poll ( ) after URG
* works now .
* Michael Pall : recv ( . . . , MSG_OOB ) never blocks in the
* BSD api .
* Alan Cox : Changed the semantics of sk - > socket to
* fix a race and a signal problem with
* accept ( ) and async I / O .
* Alan Cox : Relaxed the rules on tcp_sendto ( ) .
* Yury Shevchuk : Really fixed accept ( ) blocking problem .
* Craig I . Hagan : Allow for BSD compatible TIME_WAIT for
* clients / servers which listen in on
* fixed ports .
* Alan Cox : Cleaned the above up and shrank it to
* a sensible code size .
* Alan Cox : Self connect lockup fix .
* Alan Cox : No connect to multicast .
* Ross Biro : Close unaccepted children on master
* socket close .
* Alan Cox : Reset tracing code .
* Alan Cox : Spurious resets on shutdown .
* Alan Cox : Giant 15 minute / 60 second timer error
* Alan Cox : Small whoops in polling before an
* accept .
* Alan Cox : Kept the state trace facility since
* it ' s handy for debugging .
* Alan Cox : More reset handler fixes .
* Alan Cox : Started rewriting the code based on
* the RFC ' s for other useful protocol
* references see : Comer , KA9Q NOS , and
* for a reference on the difference
* between specifications and how BSD
* works see the 4.4 lite source .
* A . N . Kuznetsov : Don ' t time wait on completion of tidy
* close .
* Linus Torvalds : Fin / Shutdown & copied_seq changes .
* Linus Torvalds : Fixed BSD port reuse to work first syn
* Alan Cox : Reimplemented timers as per the RFC
* and using multiple timers for sanity .
* Alan Cox : Small bug fixes , and a lot of new
* comments .
* Alan Cox : Fixed dual reader crash by locking
* the buffers ( much like datagram . c )
* Alan Cox : Fixed stuck sockets in probe . A probe
* now gets fed up of retrying without
* ( even a no space ) answer .
* Alan Cox : Extracted closing code better
* Alan Cox : Fixed the closing state machine to
* resemble the RFC .
* Alan Cox : More ' per spec ' fixes .
* Jorge Cwik : Even faster checksumming .
* Alan Cox : tcp_data ( ) doesn ' t ack illegal PSH
* only frames . At least one pc tcp stack
* generates them .
* Alan Cox : Cache last socket .
* Alan Cox : Per route irtt .
* Matt Day : poll ( ) - > select ( ) match BSD precisely on error
* Alan Cox : New buffers
* Marc Tamsky : Various sk - > prot - > retransmits and
* sk - > retransmits misupdating fixed .
* Fixed tcp_write_timeout : stuck close ,
* and TCP syn retries gets used now .
* Mark Yarvis : In tcp_read_wakeup ( ) , don ' t send an
* ack if state is TCP_CLOSED .
* Alan Cox : Look up device on a retransmit - routes may
* change . Doesn ' t yet cope with MSS shrink right
* but it ' s a start !
* Marc Tamsky : Closing in closing fixes .
* Mike Shaver : RFC1122 verifications .
* Alan Cox : rcv_saddr errors .
* Alan Cox : Block double connect ( ) .
* Alan Cox : Small hooks for enSKIP .
* Alexey Kuznetsov : Path MTU discovery .
* Alan Cox : Support soft errors .
* Alan Cox : Fix MTU discovery pathological case
* when the remote claims no mtu !
* Marc Tamsky : TCP_CLOSE fix .
* Colin ( G3TNE ) : Send a reset on syn ack replies in
* window but wrong ( fixes NT lpd problems )
* Pedro Roque : Better TCP window handling , delayed ack .
* Joerg Reuter : No modification of locked buffers in
* tcp_do_retransmit ( )
* Eric Schenk : Changed receiver side silly window
* avoidance algorithm to BSD style
* algorithm . This doubles throughput
* against machines running Solaris ,
* and seems to result in general
* improvement .
* Stefan Magdalinski : adjusted tcp_readable ( ) to fix FIONREAD
* Willy Konynenberg : Transparent proxying support .
* Mike McLagan : Routing by source
* Keith Owens : Do proper merging with partial SKB ' s in
* tcp_do_sendmsg to avoid burstiness .
* Eric Schenk : Fix fast close down bug with
* shutdown ( ) followed by close ( ) .
* Andi Kleen : Make poll agree with SIGIO
* Salvatore Sanfilippo : Support SO_LINGER with linger = = 1 and
* lingertime = = 0 ( RFC 793 ABORT Call )
* Hirokazu Takahashi : Use copy_from_user ( ) instead of
* csum_and_copy_from_user ( ) if possible .
*
* Description of States :
*
* TCP_SYN_SENT sent a connection request , waiting for ack
*
* TCP_SYN_RECV received a connection request , sent ack ,
* waiting for final ack in three - way handshake .
*
* TCP_ESTABLISHED connection established
*
* TCP_FIN_WAIT1 our side has shutdown , waiting to complete
* transmission of remaining buffered data
*
* TCP_FIN_WAIT2 all buffered data sent , waiting for remote
* to shutdown
*
* TCP_CLOSING both sides have shutdown but we still have
* data we have to finish sending
*
* TCP_TIME_WAIT timeout to catch resent junk before entering
* closed , can only be entered from FIN_WAIT2
* or CLOSING . Required because the other end
* may not have gotten our last ACK causing it
* to retransmit the data packet ( which we ignore )
*
* TCP_CLOSE_WAIT remote side has shutdown and is waiting for
* us to finish writing our data and to shutdown
* ( we have to close ( ) to move on to LAST_ACK )
*
* TCP_LAST_ACK out side has shutdown after remote has
* shutdown . There may still be data in our
* buffer that we have to finish sending
*
* TCP_CLOSE socket is finished
*/
2012-03-12 11:03:32 +04:00
# define pr_fmt(fmt) "TCP: " fmt
2016-01-24 16:20:23 +03:00
# include <crypto/hash.h>
2007-08-29 02:50:33 +04:00
# include <linux/kernel.h>
2005-04-17 02:20:36 +04:00
# include <linux/module.h>
# include <linux/types.h>
# include <linux/fcntl.h>
# include <linux/poll.h>
2015-04-29 02:23:49 +03:00
# include <linux/inet_diag.h>
2005-04-17 02:20:36 +04:00
# include <linux/init.h>
# include <linux/fs.h>
2007-11-07 10:30:13 +03:00
# include <linux/skbuff.h>
2008-07-03 14:22:02 +04:00
# include <linux/scatterlist.h>
2007-11-07 10:30:13 +03:00
# include <linux/splice.h>
# include <linux/net.h>
# include <linux/socket.h>
2005-04-17 02:20:36 +04:00
# include <linux/random.h>
2018-10-31 01:09:49 +03:00
# include <linux/memblock.h>
2008-06-28 04:23:57 +04:00
# include <linux/highmem.h>
2006-03-25 12:36:56 +03:00
# include <linux/cache.h>
2006-06-22 14:02:40 +04:00
# include <linux/err.h>
2009-12-02 21:12:09 +03:00
# include <linux/time.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2017-08-23 00:08:48 +03:00
# include <linux/errqueue.h>
2017-10-25 12:01:45 +03:00
# include <linux/static_key.h>
bpf: net: Emit anonymous enum with BPF_TCP_CLOSE value explicitly
The selftest failed to compile with clang-built bpf-next.
Adding LLVM=1 to your vmlinux and selftest build will use clang.
The error message is:
progs/test_sk_storage_tracing.c:38:18: error: use of undeclared identifier 'BPF_TCP_CLOSE'
if (newstate == BPF_TCP_CLOSE)
^
1 error generated.
make: *** [Makefile:423: /bpf-next/tools/testing/selftests/bpf/test_sk_storage_tracing.o] Error 1
The reason for the failure is that BPF_TCP_CLOSE, a value of
an anonymous enum defined in uapi bpf.h, is not defined in
vmlinux.h. gcc does not have this problem. Since vmlinux.h
is derived from BTF which is derived from vmlinux DWARF,
that means gcc-produced vmlinux DWARF has BPF_TCP_CLOSE
while llvm-produced vmlinux DWARF does not have.
BPF_TCP_CLOSE is referenced in net/ipv4/tcp.c as
BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
The following test mimics the above BUILD_BUG_ON, preprocessed
with clang compiler, and shows gcc DWARF contains BPF_TCP_CLOSE while
llvm DWARF does not.
$ cat t.c
enum {
BPF_TCP_ESTABLISHED = 1,
BPF_TCP_CLOSE = 7,
};
enum {
TCP_ESTABLISHED = 1,
TCP_CLOSE = 7,
};
int test() {
do {
extern void __compiletime_assert_767(void) ;
if ((int)BPF_TCP_CLOSE != (int)TCP_CLOSE) __compiletime_assert_767();
} while (0);
return 0;
}
$ clang t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
$ gcc t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
DW_AT_name ("BPF_TCP_CLOSE")
Further checking clang code find clang actually tried to
evaluate condition at compile time. If it is definitely
true/false, it will perform optimization and the whole if condition
will be removed before generating IR/debuginfo.
This patch explicited add an expression after the
above mentioned BUILD_BUG_ON in net/ipv4/tcp.c like
(void)BPF_TCP_ESTABLISHED
to enable generation of debuginfo for the anonymous
enum which also includes BPF_TCP_CLOSE.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210317174132.589276-1-yhs@fb.com
2021-03-17 20:41:32 +03:00
# include <linux/btf.h>
2005-04-17 02:20:36 +04:00
# include <net/icmp.h>
2012-07-19 10:43:09 +04:00
# include <net/inet_common.h>
2005-04-17 02:20:36 +04:00
# include <net/tcp.h>
2020-01-22 03:56:15 +03:00
# include <net/mptcp.h>
2005-04-17 02:20:36 +04:00
# include <net/xfrm.h>
# include <net/ip.h>
2007-11-07 10:30:13 +03:00
# include <net/sock.h>
2005-04-17 02:20:36 +04:00
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2005-04-17 02:20:36 +04:00
# include <asm/ioctls.h>
2013-07-10 18:13:17 +04:00
# include <net/busy_poll.h>
2005-04-17 02:20:36 +04:00
2021-01-21 03:41:47 +03:00
/* Track pending CMSGs. */
enum {
TCP_CMSG_INQ = 1 ,
TCP_CMSG_TS = 2
} ;
2021-10-14 16:41:26 +03:00
DEFINE_PER_CPU ( unsigned int , tcp_orphan_count ) ;
EXPORT_PER_CPU_SYMBOL_GPL ( tcp_orphan_count ) ;
2005-08-10 07:11:41 +04:00
2013-10-20 03:25:36 +04:00
long sysctl_tcp_mem [ 3 ] __read_mostly ;
EXPORT_SYMBOL ( sysctl_tcp_mem ) ;
2005-04-17 02:20:36 +04:00
2021-11-15 22:02:39 +03:00
atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp ; /* Current allocated memory. */
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( tcp_memory_allocated ) ;
2022-06-09 09:34:08 +03:00
DEFINE_PER_CPU ( int , tcp_memory_per_cpu_fw_alloc ) ;
EXPORT_PER_CPU_SYMBOL_GPL ( tcp_memory_per_cpu_fw_alloc ) ;
2008-11-26 08:16:35 +03:00
2017-10-25 12:01:45 +03:00
# if IS_ENABLED(CONFIG_SMC)
DEFINE_STATIC_KEY_FALSE ( tcp_have_smc ) ;
EXPORT_SYMBOL ( tcp_have_smc ) ;
# endif
2008-11-26 08:16:35 +03:00
/*
* Current number of TCP sockets .
*/
2021-11-15 22:02:39 +03:00
struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp ;
2005-04-17 02:20:36 +04:00
EXPORT_SYMBOL ( tcp_sockets_allocated ) ;
2007-11-07 10:30:13 +03:00
/*
* TCP splice context
*/
struct tcp_splice_state {
struct pipe_inode_info * pipe ;
size_t len ;
unsigned int flags ;
} ;
2005-04-17 02:20:36 +04:00
/*
* Pressure flag : try to collapse .
* Technical note : it is used by multiple contexts non atomically .
2007-12-31 11:11:19 +03:00
* All the __sk_mem_schedule ( ) is of this nature : accounting
2005-04-17 02:20:36 +04:00
* is strict , actions are advisory and have some latency .
*/
2017-06-07 23:29:12 +03:00
unsigned long tcp_memory_pressure __read_mostly ;
EXPORT_SYMBOL_GPL ( tcp_memory_pressure ) ;
2005-04-17 02:20:36 +04:00
2008-07-17 07:28:10 +04:00
void tcp_enter_memory_pressure ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2017-06-07 23:29:12 +03:00
unsigned long val ;
2019-10-10 01:10:15 +03:00
if ( READ_ONCE ( tcp_memory_pressure ) )
2017-06-07 23:29:12 +03:00
return ;
val = jiffies ;
if ( ! val )
val - - ;
if ( ! cmpxchg ( & tcp_memory_pressure , 0 , val ) )
2008-07-17 07:30:14 +04:00
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPMEMORYPRESSURES ) ;
2005-04-17 02:20:36 +04:00
}
2017-06-07 23:29:12 +03:00
EXPORT_SYMBOL_GPL ( tcp_enter_memory_pressure ) ;
void tcp_leave_memory_pressure ( struct sock * sk )
{
unsigned long val ;
2019-10-10 01:10:15 +03:00
if ( ! READ_ONCE ( tcp_memory_pressure ) )
2017-06-07 23:29:12 +03:00
return ;
val = xchg ( & tcp_memory_pressure , 0 ) ;
if ( val )
NET_ADD_STATS ( sock_net ( sk ) , LINUX_MIB_TCPMEMORYPRESSURESCHRONO ,
jiffies_to_msecs ( jiffies - val ) ) ;
}
EXPORT_SYMBOL_GPL ( tcp_leave_memory_pressure ) ;
2005-04-17 02:20:36 +04:00
2009-10-19 14:10:40 +04:00
/* Convert seconds to retransmits based on initial and max timeout */
static u8 secs_to_retrans ( int seconds , int timeout , int rto_max )
{
u8 res = 0 ;
if ( seconds > 0 ) {
int period = timeout ;
res = 1 ;
while ( seconds > period & & res < 255 ) {
res + + ;
timeout < < = 1 ;
if ( timeout > rto_max )
timeout = rto_max ;
period + = timeout ;
}
}
return res ;
}
/* Convert retransmits to seconds based on initial and max timeout */
static int retrans_to_secs ( u8 retrans , int timeout , int rto_max )
{
int period = 0 ;
if ( retrans > 0 ) {
period = timeout ;
while ( - - retrans ) {
timeout < < = 1 ;
if ( timeout > rto_max )
timeout = rto_max ;
period + = timeout ;
}
}
return period ;
}
2017-07-28 20:28:20 +03:00
static u64 tcp_compute_delivery_rate ( const struct tcp_sock * tp )
{
u32 rate = READ_ONCE ( tp - > rate_delivered ) ;
u32 intv = READ_ONCE ( tp - > rate_interval_us ) ;
u64 rate64 = 0 ;
if ( rate & & intv ) {
rate64 = ( u64 ) rate * tp - > mss_cache * USEC_PER_SEC ;
do_div ( rate64 , intv ) ;
}
return rate64 ;
}
2012-04-19 13:55:21 +04:00
/* Address-family independent initialization for a tcp_sock.
*
* NOTE : A lot of things set to zero explicitly by call to
* sk_alloc ( ) so need not be done here .
*/
void tcp_init_sock ( struct sock * sk )
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 00:49:28 +03:00
tp - > out_of_order_queue = RB_ROOT ;
2017-10-06 08:21:27 +03:00
sk - > tcp_rtx_queue = RB_ROOT ;
2012-04-19 13:55:21 +04:00
tcp_init_xmit_timers ( sk ) ;
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 09:50:31 +04:00
INIT_LIST_HEAD ( & tp - > tsq_node ) ;
2017-10-04 22:59:58 +03:00
INIT_LIST_HEAD ( & tp - > tsorted_sent_queue ) ;
2012-04-19 13:55:21 +04:00
icsk - > icsk_rto = TCP_TIMEOUT_INIT ;
2020-08-20 22:00:27 +03:00
icsk - > icsk_rto_min = TCP_RTO_MIN ;
2020-08-20 22:00:21 +03:00
icsk - > icsk_delack_max = TCP_DELACK_MAX ;
2014-02-27 02:02:48 +04:00
tp - > mdev_us = jiffies_to_usecs ( TCP_TIMEOUT_INIT ) ;
2017-05-17 00:00:13 +03:00
minmax_reset ( & tp - > rtt_min , tcp_jiffies32 , ~ 0U ) ;
2012-04-19 13:55:21 +04:00
/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed - ACK and congestion control
* algorithms that we must have the following bandaid to talk
* efficiently to them . - DaveM
*/
2022-04-06 02:35:38 +03:00
tcp_snd_cwnd_set ( tp , TCP_INIT_CWND ) ;
2012-04-19 13:55:21 +04:00
2016-09-20 06:39:15 +03:00
/* There's a bubble in the pipe until at least the first ACK. */
tp - > app_limited = ~ 0U ;
2023-01-19 22:00:28 +03:00
tp - > rate_app_limited = 1 ;
2016-09-20 06:39:15 +03:00
2012-04-19 13:55:21 +04:00
/* See draft-stevens-tcpca-spec-01 for discussion of the
* initialization of these values .
*/
tp - > snd_ssthresh = TCP_INFINITE_SSTHRESH ;
tp - > snd_cwnd_clamp = ~ 0 ;
tp - > mss_cache = TCP_MSS_DEFAULT ;
2022-07-15 20:17:49 +03:00
tp - > reordering = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_reordering ) ;
2014-09-27 00:37:32 +04:00
tcp_assign_congestion_control ( sk ) ;
2012-04-19 13:55:21 +04:00
2013-02-11 09:50:17 +04:00
tp - > tsoffset = 0 ;
2017-11-04 02:38:48 +03:00
tp - > rack . reo_wnd_steps = 1 ;
2013-02-11 09:50:17 +04:00
2012-04-19 13:55:21 +04:00
sk - > sk_write_space = sk_stream_write_space ;
sock_set_flag ( sk , SOCK_USE_WRITE_QUEUE ) ;
icsk - > icsk_sync_mss = tcp_sync_mss ;
2022-07-22 21:22:00 +03:00
WRITE_ONCE ( sk - > sk_sndbuf , READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_wmem [ 1 ] ) ) ;
WRITE_ONCE ( sk - > sk_rcvbuf , READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_rmem [ 1 ] ) ) ;
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-17 18:29:17 +03:00
tcp_scaling_ratio_init ( sk ) ;
2012-04-19 13:55:21 +04:00
2022-10-21 13:16:39 +03:00
set_bit ( SOCK_SUPPORT_ZC , & sk - > sk_socket - > flags ) ;
2012-04-19 13:55:21 +04:00
sk_sockets_allocated_inc ( sk ) ;
}
EXPORT_SYMBOL ( tcp_init_sock ) ;
2017-10-06 08:21:23 +03:00
static void tcp_tx_timestamp ( struct sock * sk , u16 tsflags )
net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 06:11:49 +04:00
{
2017-10-06 08:21:23 +03:00
struct sk_buff * skb = tcp_write_queue_tail ( sk ) ;
2017-01-04 19:19:34 +03:00
if ( tsflags & & skb ) {
2014-08-06 23:09:44 +04:00
struct skb_shared_info * shinfo = skb_shinfo ( skb ) ;
2016-04-03 06:08:08 +03:00
struct tcp_skb_cb * tcb = TCP_SKB_CB ( skb ) ;
net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 06:11:49 +04:00
2016-04-03 06:08:12 +03:00
sock_tx_timestamp ( sk , tsflags , & shinfo - > tx_flags ) ;
2016-04-28 06:39:01 +03:00
if ( tsflags & SOF_TIMESTAMPING_TX_ACK )
tcb - > txstamp_ack = 1 ;
if ( tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK )
2014-08-06 23:09:44 +04:00
shinfo - > tskey = TCP_SKB_CB ( skb ) - > seq + skb - > len - 1 ;
}
net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 06:11:49 +04:00
}
2021-02-13 02:22:14 +03:00
static bool tcp_stream_is_readable ( struct sock * sk , int target )
2018-03-28 22:49:15 +03:00
{
2021-02-13 02:22:14 +03:00
if ( tcp_epollin_ready ( sk , target ) )
return true ;
2021-10-08 23:33:03 +03:00
return sk_is_readable ( sk ) ;
2018-03-28 22:49:15 +03:00
}
2005-04-17 02:20:36 +04:00
/*
2018-06-28 19:43:44 +03:00
* Wait for a TCP event .
*
* Note that we don ' t need to lock the socket , as the upper poll layers
* take care of normal races ( between the test and the event ) and we don ' t
* go look at any of the socket buffers directly .
2005-04-17 02:20:36 +04:00
*/
2018-06-28 19:43:44 +03:00
__poll_t tcp_poll ( struct file * file , struct socket * sock , poll_table * wait )
2005-04-17 02:20:36 +04:00
{
2018-06-28 19:43:44 +03:00
__poll_t mask ;
2005-04-17 02:20:36 +04:00
struct sock * sk = sock - > sk ;
2011-10-21 13:22:42 +04:00
const struct tcp_sock * tp = tcp_sk ( sk ) ;
2023-05-09 23:36:56 +03:00
u8 shutdown ;
2015-11-12 19:43:18 +03:00
int state ;
2005-04-17 02:20:36 +04:00
2018-10-23 14:40:39 +03:00
sock_poll_wait ( file , sock , wait ) ;
2018-06-28 19:43:44 +03:00
2017-12-20 06:12:52 +03:00
state = inet_sk_state_load ( sk ) ;
2015-11-12 19:43:18 +03:00
if ( state = = TCP_LISTEN )
2005-08-24 08:52:58 +04:00
return inet_csk_listen_poll ( sk ) ;
2005-04-17 02:20:36 +04:00
2018-06-28 19:43:44 +03:00
/* Socket is not locked. We are protected from async events
* by poll logic and correct handling of state changes
* made by other threads is impossible in any case .
*/
mask = 0 ;
2005-04-17 02:20:36 +04:00
/*
2018-02-12 01:34:03 +03:00
* EPOLLHUP is certainly not done right . But poll ( ) doesn ' t
2005-04-17 02:20:36 +04:00
* have a notion of HUP in just one direction , and for a
* socket the read side is more interesting .
*
2018-02-12 01:34:03 +03:00
* Some poll ( ) documentation says that EPOLLHUP is incompatible
* with the EPOLLOUT / POLLWR flags , so somebody should check this
2005-04-17 02:20:36 +04:00
* all . But careful , it tends to be safer to return too many
* bits than too few , and you can easily break real applications
* if you don ' t tell them that something has hung up !
*
* Check - me .
*
2018-02-12 01:34:03 +03:00
* Check number 1. EPOLLHUP is _UNMASKABLE_ event ( see UNIX98 and
2005-04-17 02:20:36 +04:00
* our fs / select . c ) . It means that after we received EOF ,
* poll always returns immediately , making impossible poll ( ) on write ( )
2018-02-12 01:34:03 +03:00
* in state CLOSE_WAIT . One solution is evident - - - to set EPOLLHUP
2005-04-17 02:20:36 +04:00
* if and only if shutdown has been made in both directions .
* Actually , it is interesting to look how Solaris and DUX
2018-02-12 01:34:03 +03:00
* solve this dilemma . I would prefer , if EPOLLHUP were maskable ,
2005-04-17 02:20:36 +04:00
* then we could set it on SND_SHUTDOWN . BTW examples given
* in Stevens ' books assume exactly this behaviour , it explains
2018-02-12 01:34:03 +03:00
* why EPOLLHUP is incompatible with EPOLLOUT . - - ANK
2005-04-17 02:20:36 +04:00
*
* NOTE . Check for TCP_CLOSE is added . The goal is to prevent
* blocking on fresh not - connected or disconnected socket . - - ANK
*/
2023-05-09 23:36:56 +03:00
shutdown = READ_ONCE ( sk - > sk_shutdown ) ;
if ( shutdown = = SHUTDOWN_MASK | | state = = TCP_CLOSE )
2018-02-12 01:34:03 +03:00
mask | = EPOLLHUP ;
2023-05-09 23:36:56 +03:00
if ( shutdown & RCV_SHUTDOWN )
2018-02-12 01:34:03 +03:00
mask | = EPOLLIN | EPOLLRDNORM | EPOLLRDHUP ;
2005-04-17 02:20:36 +04:00
2012-08-31 16:29:12 +04:00
/* Connected or passive Fast Open socket? */
2015-11-12 19:43:18 +03:00
if ( state ! = TCP_SYN_SENT & &
2019-10-11 06:17:38 +03:00
( state ! = TCP_SYN_RECV | | rcu_access_pointer ( tp - > fastopen_rsk ) ) ) {
2008-10-06 21:43:54 +04:00
int target = sock_rcvlowat ( sk , 0 , INT_MAX ) ;
2021-11-15 22:02:43 +03:00
u16 urg_data = READ_ONCE ( tp - > urg_data ) ;
2008-10-06 21:43:54 +04:00
2021-11-15 22:02:44 +03:00
if ( unlikely ( urg_data ) & &
2021-11-15 22:02:43 +03:00
READ_ONCE ( tp - > urg_seq ) = = READ_ONCE ( tp - > copied_seq ) & &
! sock_flag ( sk , SOCK_URGINLINE ) )
2010-03-19 06:29:24 +03:00
target + + ;
2008-10-06 21:43:54 +04:00
2021-02-13 02:22:14 +03:00
if ( tcp_stream_is_readable ( sk , target ) )
2018-02-12 01:34:03 +03:00
mask | = EPOLLIN | EPOLLRDNORM ;
2005-04-17 02:20:36 +04:00
2023-05-09 23:36:56 +03:00
if ( ! ( shutdown & SEND_SHUTDOWN ) ) {
2020-09-15 00:52:09 +03:00
if ( __sk_stream_is_writeable ( sk , 1 ) ) {
2018-02-12 01:34:03 +03:00
mask | = EPOLLOUT | EPOLLWRNORM ;
2005-04-17 02:20:36 +04:00
} else { /* send SIGIO later */
2015-11-30 07:03:10 +03:00
sk_set_bit ( SOCKWQ_ASYNC_NOSPACE , sk ) ;
2005-04-17 02:20:36 +04:00
set_bit ( SOCK_NOSPACE , & sk - > sk_socket - > flags ) ;
/* Race breaker. If space is freed after
* wspace test but before the flags are set ,
2015-04-20 23:05:07 +03:00
* IO signal will be lost . Memory barrier
* pairs with the input side .
2005-04-17 02:20:36 +04:00
*/
2015-04-20 23:05:07 +03:00
smp_mb__after_atomic ( ) ;
2020-09-15 00:52:09 +03:00
if ( __sk_stream_is_writeable ( sk , 1 ) )
2018-02-12 01:34:03 +03:00
mask | = EPOLLOUT | EPOLLWRNORM ;
2005-04-17 02:20:36 +04:00
}
2010-08-24 20:05:48 +04:00
} else
2018-02-12 01:34:03 +03:00
mask | = EPOLLOUT | EPOLLWRNORM ;
2005-04-17 02:20:36 +04:00
2021-11-15 22:02:43 +03:00
if ( urg_data & TCP_URG_VALID )
2018-02-12 01:34:03 +03:00
mask | = EPOLLPRI ;
2023-08-16 11:15:45 +03:00
} else if ( state = = TCP_SYN_SENT & &
inet_test_bit ( DEFER_CONNECT , sk ) ) {
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
/* Active TCP fastopen socket with defer_connect
2018-02-12 01:34:03 +03:00
* Return EPOLLOUT so application can call write ( )
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
* in order for kernel to generate SYN + data
*/
2018-02-12 01:34:03 +03:00
mask | = EPOLLOUT | EPOLLWRNORM ;
2005-04-17 02:20:36 +04:00
}
2010-09-21 02:42:05 +04:00
/* This barrier is coupled with smp_wmb() in tcp_reset() */
smp_rmb ( ) ;
2023-03-15 23:57:44 +03:00
if ( READ_ONCE ( sk - > sk_err ) | |
! skb_queue_empty_lockless ( & sk - > sk_error_queue ) )
2018-02-12 01:34:03 +03:00
mask | = EPOLLERR ;
2010-09-21 02:42:05 +04:00
2005-04-17 02:20:36 +04:00
return mask ;
}
2018-06-28 19:43:44 +03:00
EXPORT_SYMBOL ( tcp_poll ) ;
2005-04-17 02:20:36 +04:00
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);
(Important to say that this patch does not touch the "struct proto_ops"
protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:
1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:
* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.
* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6
* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09 18:27:42 +03:00
int tcp_ioctl ( struct sock * sk , int cmd , int * karg )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
int answ ;
2012-10-22 00:06:56 +04:00
bool slow ;
2005-04-17 02:20:36 +04:00
switch ( cmd ) {
case SIOCINQ :
if ( sk - > sk_state = = TCP_LISTEN )
return - EINVAL ;
2012-10-22 00:06:56 +04:00
slow = lock_sock_fast ( sk ) ;
2016-03-08 01:11:05 +03:00
answ = tcp_inq ( sk ) ;
2012-10-22 00:06:56 +04:00
unlock_sock_fast ( sk , slow ) ;
2005-04-17 02:20:36 +04:00
break ;
case SIOCATMARK :
2021-11-15 22:02:43 +03:00
answ = READ_ONCE ( tp - > urg_data ) & &
2019-10-11 06:17:43 +03:00
READ_ONCE ( tp - > urg_seq ) = = READ_ONCE ( tp - > copied_seq ) ;
2005-04-17 02:20:36 +04:00
break ;
case SIOCOUTQ :
if ( sk - > sk_state = = TCP_LISTEN )
return - EINVAL ;
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) )
answ = 0 ;
else
2019-10-11 06:17:41 +03:00
answ = READ_ONCE ( tp - > write_seq ) - tp - > snd_una ;
2005-04-17 02:20:36 +04:00
break ;
2011-03-10 01:08:09 +03:00
case SIOCOUTQNSD :
if ( sk - > sk_state = = TCP_LISTEN )
return - EINVAL ;
if ( ( 1 < < sk - > sk_state ) & ( TCPF_SYN_SENT | TCPF_SYN_RECV ) )
answ = 0 ;
else
2019-10-11 06:17:42 +03:00
answ = READ_ONCE ( tp - > write_seq ) -
READ_ONCE ( tp - > snd_nxt ) ;
2011-03-10 01:08:09 +03:00
break ;
2005-04-17 02:20:36 +04:00
default :
return - ENOIOCTLCMD ;
2007-04-21 04:09:22 +04:00
}
2005-04-17 02:20:36 +04:00
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);
(Important to say that this patch does not touch the "struct proto_ops"
protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:
1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:
* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.
* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6
* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09 18:27:42 +03:00
* karg = answ ;
return 0 ;
2005-04-17 02:20:36 +04:00
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_ioctl ) ;
2005-04-17 02:20:36 +04:00
2021-09-22 20:26:40 +03:00
void tcp_mark_push ( struct tcp_sock * tp , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
2011-09-27 21:25:05 +04:00
TCP_SKB_CB ( skb ) - > tcp_flags | = TCPHDR_PSH ;
2005-04-17 02:20:36 +04:00
tp - > pushed_seq = tp - > write_seq ;
}
2012-05-17 03:15:34 +04:00
static inline bool forced_push ( const struct tcp_sock * tp )
2005-04-17 02:20:36 +04:00
{
return after ( tp - > write_seq , tp - > pushed_seq + ( tp - > max_window > > 1 ) ) ;
}
2021-09-22 20:26:40 +03:00
void tcp_skb_entail ( struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
[TCP]: Sed magic converts func(sk, tp, ...) -> func(sk, ...)
This is (mostly) automated change using magic:
sed -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N'
-e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N'
-e 's|struct sock \*sk,[\n\t ]*struct tcp_sock \*tp\([^{]*\n{\n\)|
struct sock \*sk\1\tstruct tcp_sock *tp = tcp_sk(sk);\n|g'
-e 's|struct sock \*sk, struct tcp_sock \*tp|
struct sock \*sk|g' -e 's|sk, tp\([^-]\)|sk\1|g'
Fixed four unused variable (tp) warnings that were introduced.
In addition, manually added newlines after local variables and
tweaked function arguments positioning.
$ gcc --version
gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)
...
$ codiff -fV built-in.o.old built-in.o.new
net/ipv4/route.c:
rt_cache_flush | +14
1 function changed, 14 bytes added
net/ipv4/tcp.c:
tcp_setsockopt | -5
tcp_sendpage | -25
tcp_sendmsg | -16
3 functions changed, 46 bytes removed
net/ipv4/tcp_input.c:
tcp_try_undo_recovery | +3
tcp_try_undo_dsack | +2
tcp_mark_head_lost | -12
tcp_ack | -15
tcp_event_data_recv | -32
tcp_rcv_state_process | -10
tcp_rcv_established | +1
7 functions changed, 6 bytes added, 69 bytes removed, diff: -63
net/ipv4/tcp_output.c:
update_send_head | -9
tcp_transmit_skb | +19
tcp_cwnd_validate | +1
tcp_write_wakeup | -17
__tcp_push_pending_frames | -25
tcp_push_one | -8
tcp_send_fin | -4
7 functions changed, 20 bytes added, 63 bytes removed, diff: -43
built-in.o.new:
18 functions changed, 40 bytes added, 178 bytes removed, diff: -138
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-21 09:18:02 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2006-11-18 00:59:12 +03:00
struct tcp_skb_cb * tcb = TCP_SKB_CB ( skb ) ;
tcb - > seq = tcb - > end_seq = tp - > write_seq ;
2011-09-27 21:25:05 +04:00
tcb - > tcp_flags = TCPHDR_ACK ;
2014-09-23 03:29:32 +04:00
__skb_header_release ( skb ) ;
2007-03-07 23:12:44 +03:00
tcp_add_write_queue_tail ( sk , skb ) ;
2019-10-11 06:17:46 +03:00
sk_wmem_queued_add ( sk , skb - > truesize ) ;
2007-12-31 11:11:19 +03:00
sk_mem_charge ( sk , skb - > truesize ) ;
2005-08-23 21:13:06 +04:00
if ( tp - > nonagle & TCP_NAGLE_PUSH )
2007-02-09 17:24:47 +03:00
tp - > nonagle & = ~ TCP_NAGLE_PUSH ;
tcp: fix slow start after idle vs TSO/GSO
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21 22:30:00 +03:00
tcp_slow_start_after_idle_check ( sk ) ;
2005-04-17 02:20:36 +04:00
}
2009-12-10 10:16:52 +03:00
static inline void tcp_mark_urg ( struct tcp_sock * tp , int flags )
2005-04-17 02:20:36 +04:00
{
2008-10-08 01:43:06 +04:00
if ( flags & MSG_OOB )
2005-04-17 02:20:36 +04:00
tp - > snd_up = tp - > write_seq ;
}
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
/* If a not yet filled skb is pushed, do not send it if
tcp: autocork should not hold first packet in write queue
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 21:58:30 +04:00
* we have data packets in Qdisc or NIC queues :
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
* Because TX completion will happen shortly , it gives a chance
* to coalesce future sendmsg ( ) payload into this skb , without
* need for a timer , and with no latency trade off .
* As packets containing data payload have a bigger truesize
tcp: autocork should not hold first packet in write queue
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 21:58:30 +04:00
* than pure acks ( dataless ) packets , the last checks prevent
* autocorking if we only have an ACK in Qdisc / NIC queues ,
* or if TX completion was delayed after we processed ACK packet .
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
*/
static bool tcp_should_autocork ( struct sock * sk , struct sk_buff * skb ,
int size_goal )
2005-04-17 02:20:36 +04:00
{
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
return skb - > len < size_goal & &
2022-07-20 19:50:25 +03:00
READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_autocorking ) & &
2018-05-03 06:25:13 +03:00
! tcp_rtx_queue_empty ( sk ) & &
2022-03-09 08:47:06 +03:00
refcount_read ( & sk - > sk_wmem_alloc ) > skb - > truesize & &
tcp_skb_can_collapse_to ( skb ) ;
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
}
2020-01-09 18:59:21 +03:00
void tcp_push ( struct sock * sk , int flags , int mss_now ,
int nonagle , int size_goal )
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
struct sk_buff * skb ;
2009-12-10 10:16:52 +03:00
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
skb = tcp_write_queue_tail ( sk ) ;
2017-10-06 08:21:27 +03:00
if ( ! skb )
return ;
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
if ( ! ( flags & MSG_MORE ) | | forced_push ( tp ) )
tcp_mark_push ( tp , skb ) ;
tcp_mark_urg ( tp , flags ) ;
if ( tcp_should_autocork ( sk , skb , size_goal ) ) {
/* avoid atomic op if TSQ_THROTTLED bit is already set */
2016-12-03 22:14:57 +03:00
if ( ! test_bit ( TSQ_THROTTLED , & sk - > sk_tsq_flags ) ) {
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPAUTOCORKING ) ;
2016-12-03 22:14:57 +03:00
set_bit ( TSQ_THROTTLED , & sk - > sk_tsq_flags ) ;
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
}
tcp: autocork should not hold first packet in write queue
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 21:58:30 +04:00
/* It is possible TX completion already happened
* before we set TSQ_THROTTLED .
*/
2017-06-30 13:08:00 +03:00
if ( refcount_read ( & sk - > sk_wmem_alloc ) > skb - > truesize )
tcp: autocork should not hold first packet in write queue
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 21:58:30 +04:00
return ;
2005-04-17 02:20:36 +04:00
}
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
if ( flags & MSG_MORE )
nonagle = TCP_NAGLE_CORK ;
__tcp_push_pending_frames ( sk , mss_now , nonagle ) ;
2005-04-17 02:20:36 +04:00
}
2007-11-07 10:32:26 +03:00
static int tcp_splice_data_recv ( read_descriptor_t * rd_desc , struct sk_buff * skb ,
unsigned int offset , size_t len )
2007-11-07 10:30:13 +03:00
{
struct tcp_splice_state * tss = rd_desc - > arg . data ;
2009-01-14 03:04:36 +03:00
int ret ;
2007-11-07 10:30:13 +03:00
2015-05-21 18:00:00 +03:00
ret = skb_splice_bits ( skb , skb - > sk , offset , tss - > pipe ,
2016-09-18 04:02:10 +03:00
min ( rd_desc - > count , len ) , tss - > flags ) ;
2009-01-14 03:04:36 +03:00
if ( ret > 0 )
rd_desc - > count - = ret ;
return ret ;
2007-11-07 10:30:13 +03:00
}
static int __tcp_splice_read ( struct sock * sk , struct tcp_splice_state * tss )
{
/* Store TCP splice context information in read_descriptor_t. */
read_descriptor_t rd_desc = {
. arg . data = tss ,
2009-01-14 03:04:36 +03:00
. count = tss - > len ,
2007-11-07 10:30:13 +03:00
} ;
return tcp_read_sock ( sk , & rd_desc , tcp_splice_data_recv ) ;
}
/**
* tcp_splice_read - splice data from TCP socket to a pipe
* @ sock : socket to splice from
* @ ppos : position ( not valid )
* @ pipe : pipe to splice to
* @ len : number of bytes to splice
* @ flags : splice modifier flags
*
* Description :
* Will read pages from given socket and fill them into a pipe .
*
* */
ssize_t tcp_splice_read ( struct socket * sock , loff_t * ppos ,
struct pipe_inode_info * pipe , size_t len ,
unsigned int flags )
{
struct sock * sk = sock - > sk ;
struct tcp_splice_state tss = {
. pipe = pipe ,
. len = len ,
. flags = flags ,
} ;
long timeo ;
ssize_t spliced ;
int ret ;
2010-07-13 01:00:12 +04:00
sock_rps_record_flow ( sk ) ;
2007-11-07 10:30:13 +03:00
/*
* We can ' t seek on a socket input
*/
if ( unlikely ( * ppos ) )
return - ESPIPE ;
ret = spliced = 0 ;
lock_sock ( sk ) ;
net: splice() from tcp to pipe should take into account O_NONBLOCK
tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag
Before this patch :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE);
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.
User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.
Many projects cannot use splice(tcp -> pipe) because of this flaw.
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html
Linus introduced SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
(splice: add SPLICE_F_NONBLOCK flag )
It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.
Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only
This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.
Users will then call :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK );
to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)
First version of this patch was submitted by Octavian Purdila
Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-02 02:26:00 +04:00
timeo = sock_rcvtimeo ( sk , sock - > file - > f_flags & O_NONBLOCK ) ;
2007-11-07 10:30:13 +03:00
while ( tss . len ) {
ret = __tcp_splice_read ( sk , & tss ) ;
if ( ret < 0 )
break ;
else if ( ! ret ) {
if ( spliced )
break ;
if ( sock_flag ( sk , SOCK_DONE ) )
break ;
if ( sk - > sk_err ) {
ret = sock_error ( sk ) ;
break ;
}
if ( sk - > sk_shutdown & RCV_SHUTDOWN )
break ;
if ( sk - > sk_state = = TCP_CLOSE ) {
/*
* This occurs when user tries to read
* from never connected socket .
*/
2018-07-08 09:15:56 +03:00
ret = - ENOTCONN ;
2007-11-07 10:30:13 +03:00
break ;
}
if ( ! timeo ) {
ret = - EAGAIN ;
break ;
}
2017-02-04 01:59:38 +03:00
/* if __tcp_splice_read() got nothing while we have
* an skb in receive queue , we do not want to loop .
* This might happen with URG data .
*/
if ( ! skb_queue_empty ( & sk - > sk_receive_queue ) )
break ;
2023-10-11 10:20:55 +03:00
ret = sk_wait_data ( sk , & timeo , NULL ) ;
if ( ret < 0 )
break ;
2007-11-07 10:30:13 +03:00
if ( signal_pending ( current ) ) {
ret = sock_intr_errno ( timeo ) ;
break ;
}
continue ;
}
tss . len - = ret ;
spliced + = ret ;
2023-06-23 15:38:55 +03:00
if ( ! tss . len | | ! timeo )
2009-01-14 03:04:36 +03:00
break ;
2007-11-07 10:30:13 +03:00
release_sock ( sk ) ;
lock_sock ( sk ) ;
if ( sk - > sk_err | | sk - > sk_state = = TCP_CLOSE | |
2009-01-14 03:04:36 +03:00
( sk - > sk_shutdown & RCV_SHUTDOWN ) | |
2007-11-07 10:30:13 +03:00
signal_pending ( current ) )
break ;
}
release_sock ( sk ) ;
if ( spliced )
return spliced ;
return ret ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_splice_read ) ;
2007-11-07 10:30:13 +03:00
2023-06-09 23:42:46 +03:00
struct sk_buff * tcp_stream_alloc_skb ( struct sock * sk , gfp_t gfp ,
2021-10-26 01:13:40 +03:00
bool force_schedule )
2007-11-29 12:28:50 +03:00
{
struct sk_buff * skb ;
2023-06-09 23:42:46 +03:00
skb = alloc_skb_fclone ( MAX_TCP_HEADER , gfp ) ;
2015-05-15 22:39:28 +03:00
if ( likely ( skb ) ) {
2015-05-19 23:26:55 +03:00
bool mem_scheduled ;
2015-05-15 22:39:28 +03:00
2021-11-03 05:58:44 +03:00
skb - > truesize = SKB_TRUESIZE ( skb_end_offset ( skb ) ) ;
2015-05-19 23:26:55 +03:00
if ( force_schedule ) {
mem_scheduled = true ;
2015-05-15 22:39:28 +03:00
sk_forced_mem_schedule ( sk , skb - > truesize ) ;
} else {
2015-05-19 23:26:55 +03:00
mem_scheduled = sk_wmem_schedule ( sk , skb - > truesize ) ;
2015-05-15 22:39:28 +03:00
}
2015-05-19 23:26:55 +03:00
if ( likely ( mem_scheduled ) ) {
2021-10-26 01:13:41 +03:00
skb_reserve ( skb , MAX_TCP_HEADER ) ;
2021-10-27 23:19:21 +03:00
skb - > ip_summed = CHECKSUM_PARTIAL ;
2017-10-04 22:59:58 +03:00
INIT_LIST_HEAD ( & skb - > tcp_tsorted_anchor ) ;
2007-11-29 12:28:50 +03:00
return skb ;
}
__kfree_skb ( skb ) ;
} else {
2008-07-17 07:28:10 +04:00
sk - > sk_prot - > enter_memory_pressure ( sk ) ;
2007-11-29 12:28:50 +03:00
sk_stream_moderate_sndbuf ( sk ) ;
}
return NULL ;
}
2009-03-14 17:23:05 +03:00
static unsigned int tcp_xmit_size_goal ( struct sock * sk , u32 mss_now ,
int large_allowed )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2015-03-05 19:03:06 +03:00
u32 new_size_goal , size_goal ;
tcp: refine TSO autosizing
Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-07 23:22:18 +03:00
2018-02-19 22:56:48 +03:00
if ( ! large_allowed )
tcp: refine TSO autosizing
Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-07 23:22:18 +03:00
return mss_now ;
2015-03-05 19:03:06 +03:00
/* Note : tcp_tso_autosize() will eventually split this later */
2022-01-25 05:45:11 +03:00
new_size_goal = tcp_bound_to_half_wnd ( tp , sk - > sk_gso_max_size ) ;
tcp: refine TSO autosizing
Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-07 23:22:18 +03:00
/* We try hard to avoid divides here */
size_goal = tp - > gso_segs * mss_now ;
if ( unlikely ( new_size_goal < size_goal | |
new_size_goal > = size_goal + mss_now ) ) {
tp - > gso_segs = min_t ( u16 , new_size_goal / mss_now ,
sk - > sk_gso_max_segs ) ;
size_goal = tp - > gso_segs * mss_now ;
2009-03-14 17:23:05 +03:00
}
tcp: refine TSO autosizing
Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-07 23:22:18 +03:00
return max ( size_goal , mss_now ) ;
2009-03-14 17:23:05 +03:00
}
2020-01-09 18:59:21 +03:00
int tcp_send_mss ( struct sock * sk , int * size_goal , int flags )
2009-03-14 17:23:05 +03:00
{
int mss_now ;
mss_now = tcp_current_mss ( sk ) ;
* size_goal = tcp_xmit_size_goal ( sk , mss_now , ! ( flags & MSG_OOB ) ) ;
return mss_now ;
}
2023-06-24 01:55:12 +03:00
/* In some cases, both sendmsg() could have added an skb to the write queue,
* but failed adding payload on it . We need to remove it to consume less
* memory , but more importantly be able to generate EPOLLOUT for Edge Trigger
* epoll ( ) users .
2019-08-26 19:19:15 +03:00
*/
2021-10-27 23:19:18 +03:00
void tcp_remove_empty_skb ( struct sock * sk )
2019-08-26 19:19:15 +03:00
{
2021-10-27 23:19:18 +03:00
struct sk_buff * skb = tcp_write_queue_tail ( sk ) ;
2021-10-25 02:59:03 +03:00
if ( skb & & TCP_SKB_CB ( skb ) - > seq = = TCP_SKB_CB ( skb ) - > end_seq ) {
2019-08-26 19:19:15 +03:00
tcp_unlink_write_queue ( skb , sk ) ;
if ( tcp_write_queue_empty ( sk ) )
tcp_chrono_stop ( sk , TCP_CHRONO_BUSY ) ;
2021-10-30 05:05:41 +03:00
tcp_wmem_free_skb ( sk , skb ) ;
2019-08-26 19:19:15 +03:00
}
}
2022-02-04 01:55:47 +03:00
/* skb changing from pure zc to mixed, must charge zc */
static int tcp_downgrade_zcopy_pure ( struct sock * sk , struct sk_buff * skb )
{
if ( unlikely ( skb_zcopy_pure ( skb ) ) ) {
u32 extra = skb - > truesize -
SKB_TRUESIZE ( skb_end_offset ( skb ) ) ;
if ( ! sk_wmem_schedule ( sk , extra ) )
return - ENOMEM ;
sk_mem_charge ( sk , extra ) ;
skb_shinfo ( skb ) - > flags & = ~ SKBFL_PURE_ZEROCOPY ;
}
return 0 ;
}
2022-06-14 20:17:34 +03:00
2023-06-09 23:42:44 +03:00
int tcp_wmem_schedule ( struct sock * sk , int copy )
2022-06-14 20:17:34 +03:00
{
int left ;
if ( likely ( sk_wmem_schedule ( sk , copy ) ) )
return copy ;
/* We could be in trouble if we have nothing queued.
* Use whatever is left in sk - > sk_forward_alloc and tcp_wmem [ 0 ]
* to guarantee some progress .
*/
left = sock_net ( sk ) - > ipv4 . sysctl_tcp_wmem [ 0 ] - sk - > sk_wmem_queued ;
if ( left > 0 )
sk_forced_mem_schedule ( sk , min ( left , copy ) ) ;
return min ( copy , sk - > sk_forward_alloc ) ;
}
2012-07-19 10:43:09 +04:00
void tcp_free_fastopen_req ( struct tcp_sock * tp )
{
2015-04-03 11:17:27 +03:00
if ( tp - > fastopen_req ) {
2012-07-19 10:43:09 +04:00
kfree ( tp - > fastopen_req ) ;
tp - > fastopen_req = NULL ;
}
}
2022-09-27 02:27:37 +03:00
int tcp_sendmsg_fastopen ( struct sock * sk , struct msghdr * msg , int * copied ,
size_t size , struct ubuf_info * uarg )
2012-07-19 10:43:09 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
struct inet_sock * inet = inet_sk ( sk ) ;
2017-05-24 19:59:31 +03:00
struct sockaddr * uaddr = msg - > msg_name ;
2012-07-19 10:43:09 +04:00
int err , flags ;
2022-07-15 20:17:54 +03:00
if ( ! ( READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_fastopen ) &
TFO_CLIENT_ENABLE ) | |
2017-05-24 19:59:31 +03:00
( uaddr & & msg - > msg_namelen > = sizeof ( uaddr - > sa_family ) & &
uaddr - > sa_family = = AF_UNSPEC ) )
2012-07-19 10:43:09 +04:00
return - EOPNOTSUPP ;
2015-04-03 11:17:27 +03:00
if ( tp - > fastopen_req )
2012-07-19 10:43:09 +04:00
return - EALREADY ; /* Another Fast Open is in progress */
tp - > fastopen_req = kzalloc ( sizeof ( struct tcp_fastopen_request ) ,
sk - > sk_allocation ) ;
2015-04-03 11:17:26 +03:00
if ( unlikely ( ! tp - > fastopen_req ) )
2012-07-19 10:43:09 +04:00
return - ENOBUFS ;
tp - > fastopen_req - > data = msg ;
2014-02-20 22:09:18 +04:00
tp - > fastopen_req - > size = size ;
2019-01-25 19:17:23 +03:00
tp - > fastopen_req - > uarg = uarg ;
2012-07-19 10:43:09 +04:00
2023-08-16 11:15:45 +03:00
if ( inet_test_bit ( DEFER_CONNECT , sk ) ) {
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
err = tcp_connect ( sk ) ;
/* Same failure procedure as in tcp_v4/6_connect */
if ( err ) {
tcp_set_state ( sk , TCP_CLOSE ) ;
inet - > inet_dport = 0 ;
sk - > sk_route_caps = 0 ;
}
}
2012-07-19 10:43:09 +04:00
flags = ( msg - > msg_flags & MSG_DONTWAIT ) ? O_NONBLOCK : 0 ;
2017-05-24 19:59:31 +03:00
err = __inet_stream_connect ( sk - > sk_socket , uaddr ,
2017-01-25 16:42:46 +03:00
msg - > msg_namelen , flags , 1 ) ;
2017-03-02 00:29:48 +03:00
/* fastopen_req could already be freed in __inet_stream_connect
* if the connection times out or gets rst
*/
if ( tp - > fastopen_req ) {
* copied = tp - > fastopen_req - > copied ;
tcp_free_fastopen_req ( tp ) ;
2023-08-16 11:15:45 +03:00
inet_clear_bit ( DEFER_CONNECT , sk ) ;
2017-03-02 00:29:48 +03:00
}
2012-07-19 10:43:09 +04:00
return err ;
}
2017-07-29 02:22:41 +03:00
int tcp_sendmsg_locked ( struct sock * sk , struct msghdr * msg , size_t size )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2017-08-03 23:29:44 +03:00
struct ubuf_info * uarg = NULL ;
2005-04-17 02:20:36 +04:00
struct sk_buff * skb ;
2016-04-03 06:08:12 +03:00
struct sockcm_cookie sockc ;
2014-11-28 21:40:20 +03:00
int flags , err , copied = 0 ;
int mss_now = 0 , size_goal , copied_syn = 0 ;
2019-08-09 15:04:47 +03:00
int process_backlog = 0 ;
2023-05-22 15:11:13 +03:00
int zc = 0 ;
2005-04-17 02:20:36 +04:00
long timeo ;
flags = msg - > msg_flags ;
2017-08-03 23:29:44 +03:00
2022-07-12 23:52:35 +03:00
if ( ( flags & MSG_ZEROCOPY ) & & size ) {
if ( msg - > msg_ubuf ) {
uarg = msg - > msg_ubuf ;
2023-05-22 15:11:13 +03:00
if ( sk - > sk_route_caps & NETIF_F_SG )
zc = MSG_ZEROCOPY ;
2022-07-12 23:52:35 +03:00
} else if ( sock_flag ( sk , SOCK_ZEROCOPY ) ) {
2023-05-15 19:06:36 +03:00
skb = tcp_write_queue_tail ( sk ) ;
2022-07-12 23:52:35 +03:00
uarg = msg_zerocopy_realloc ( sk , size , skb_zcopy ( skb ) ) ;
if ( ! uarg ) {
err = - ENOBUFS ;
goto out_err ;
}
2023-05-22 15:11:13 +03:00
if ( sk - > sk_route_caps & NETIF_F_SG )
zc = MSG_ZEROCOPY ;
else
2022-09-23 19:39:04 +03:00
uarg_to_msgzc ( uarg ) - > zerocopy = 0 ;
2022-07-12 23:52:35 +03:00
}
2023-05-22 15:11:13 +03:00
} else if ( unlikely ( msg - > msg_flags & MSG_SPLICE_PAGES ) & & size ) {
if ( sk - > sk_route_caps & NETIF_F_SG )
zc = MSG_SPLICE_PAGES ;
2017-08-03 23:29:44 +03:00
}
2023-08-16 11:15:45 +03:00
if ( unlikely ( flags & MSG_FASTOPEN | |
inet_test_bit ( DEFER_CONNECT , sk ) ) & &
2018-04-25 21:33:08 +03:00
! tp - > repair ) {
2019-01-25 19:17:23 +03:00
err = tcp_sendmsg_fastopen ( sk , msg , & copied_syn , size , uarg ) ;
2012-07-19 10:43:09 +04:00
if ( err = = - EINPROGRESS & & copied_syn > 0 )
goto out ;
else if ( err )
goto out_err ;
}
2005-04-17 02:20:36 +04:00
timeo = sock_sndtimeo ( sk , flags & MSG_DONTWAIT ) ;
2016-09-20 06:39:15 +03:00
tcp_rate_check_app_limited ( sk ) ; /* is sending application-limited? */
2012-08-31 16:29:12 +04:00
/* Wait for a connection to finish. One exception is TCP Fast Open
* ( passive side ) where data is allowed to be sent before a connection
* is fully established .
*/
if ( ( ( 1 < < sk - > sk_state ) & ~ ( TCPF_ESTABLISHED | TCPF_CLOSE_WAIT ) ) & &
! tcp_passive_fastopen ( sk ) ) {
2015-10-06 20:53:29 +03:00
err = sk_stream_wait_connect ( sk , & timeo ) ;
if ( err ! = 0 )
2012-07-19 10:43:09 +04:00
goto do_error ;
2012-08-31 16:29:12 +04:00
}
2005-04-17 02:20:36 +04:00
2012-04-19 07:41:01 +04:00
if ( unlikely ( tp - > repair ) ) {
if ( tp - > repair_queue = = TCP_RECV_QUEUE ) {
copied = tcp_send_rcvq ( sk , msg , size ) ;
tcp: Fix divide by zero when pushing during tcp-repair
When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
tcp_push with mss_now being 0. If data is in the send-queue and
tcp_set_skb_tso_segs gets called, we crash because it will divide by
mss_now:
[ 347.151939] divide error: 0000 [#1] SMP
[ 347.152907] Modules linked in:
[ 347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
[ 347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
[ 347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1
[ 347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
[ 347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
[ 347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
[ 347.152907] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
[ 347.152907] Stack:
[ 347.152907] c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
[ 347.152907] f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
[ 347.152907] c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
[ 347.152907] Call Trace:
[ 347.152907] [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34
[ 347.152907] [<c16013e6>] tcp_init_tso_segs+0x36/0x50
[ 347.152907] [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0
[ 347.152907] [<c10a0188>] ? up+0x28/0x40
[ 347.152907] [<c10acc85>] ? console_unlock+0x295/0x480
[ 347.152907] [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0
[ 347.152907] [<c1605716>] __tcp_push_pending_frames+0x36/0xd0
[ 347.152907] [<c15f4860>] tcp_push+0xf0/0x120
[ 347.152907] [<c15f7641>] tcp_sendmsg+0xf1/0xbf0
[ 347.152907] [<c116d920>] ? kmem_cache_free+0xf0/0x120
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c114f0f0>] ? do_wp_page+0x3e0/0x850
[ 347.152907] [<c161c36a>] inet_sendmsg+0x4a/0xb0
[ 347.152907] [<c1150269>] ? handle_mm_fault+0x709/0xfb0
[ 347.152907] [<c15a006b>] sock_aio_write+0xbb/0xd0
[ 347.152907] [<c1180b79>] do_sync_write+0x69/0xa0
[ 347.152907] [<c1181023>] vfs_write+0x123/0x160
[ 347.152907] [<c1181d55>] SyS_write+0x55/0xb0
[ 347.152907] [<c167f0d8>] sysenter_do_call+0x12/0x28
This can easily be reproduced with the following packetdrill-script (the
"magic" with netem, sk_pacing and limit_output_bytes is done to prevent
the kernel from pushing all segments, because hitting the limit without
doing this is not so easy with packetdrill):
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1460>
+0 > S. 0:0(0) ack 1 <mss 1460>
+0.1 < . 1:1(0) ack 1 win 65000
+0 accept(3, ..., ...) = 4
// This forces that not all segments of the snd-queue will be pushed
+0 `tc qdisc add dev tun0 root netem delay 10ms`
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
+0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0
+0 write(4,...,10000) = 10000
+0 write(4,...,10000) = 10000
// Set tcp-repair stuff, particularly TCP_RECV_QUEUE
+0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
+0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0
// This now will make the write push the remaining segments
+0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`
// Now we will crash
+0 write(4,...,1000) = 1000
This happens since ec3423257508 (tcp: fix retransmission in repair
mode). Prior to that, the call to tcp_push was prevented by a check for
tp->repair.
The patch fixes it, by adding the new goto-label out_nopush. When exiting
tcp_sendmsg and a push is not required, which is the case for tp->repair,
we go to this label.
When repairing and calling send() with TCP_RECV_QUEUE, the data is
actually put in the receive-queue. So, no push is required because no
data has been added to the send-queue.
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Fixes: ec3423257508 (tcp: fix retransmission in repair mode)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-28 20:26:37 +04:00
goto out_nopush ;
2012-04-19 07:41:01 +04:00
}
err = - EINVAL ;
if ( tp - > repair_queue = = TCP_NO_QUEUE )
goto out_err ;
/* 'common' sending to sendq */
}
2018-07-06 17:12:56 +03:00
sockcm_init ( & sockc , sk ) ;
2016-04-03 06:08:12 +03:00
if ( msg - > msg_controllen ) {
err = sock_cmsg_send ( sk , msg , & sockc ) ;
if ( unlikely ( err ) ) {
err = - EINVAL ;
goto out_err ;
}
}
2005-04-17 02:20:36 +04:00
/* This should be in poll */
2015-11-30 07:03:10 +03:00
sk_clear_bit ( SOCKWQ_ASYNC_NOSPACE , sk ) ;
2005-04-17 02:20:36 +04:00
/* Ok commence sending. */
copied = 0 ;
2016-04-30 00:16:53 +03:00
restart :
mss_now = tcp_send_mss ( sk , & size_goal , flags ) ;
2005-04-17 02:20:36 +04:00
err = - EPIPE ;
if ( sk - > sk_err | | ( sk - > sk_shutdown & SEND_SHUTDOWN ) )
2016-11-03 00:41:50 +03:00
goto do_error ;
2005-04-17 02:20:36 +04:00
2014-12-16 05:39:31 +03:00
while ( msg_data_left ( msg ) ) {
2023-05-22 15:11:13 +03:00
ssize_t copy = 0 ;
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
skb = tcp_write_queue_tail ( sk ) ;
2018-02-19 22:56:50 +03:00
if ( skb )
copy = size_goal - skb - > len ;
2005-04-17 02:20:36 +04:00
tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 00:44:48 +03:00
if ( copy < = 0 | | ! tcp_skb_can_collapse_to ( skb ) ) {
2016-09-15 19:33:02 +03:00
bool first_skb ;
2005-04-17 02:20:36 +04:00
new_segment :
2014-11-28 21:40:20 +03:00
if ( ! sk_stream_memory_free ( sk ) )
2020-09-15 00:52:10 +03:00
goto wait_for_space ;
2005-04-17 02:20:36 +04:00
2019-08-09 15:04:47 +03:00
if ( unlikely ( process_backlog > = 16 ) ) {
process_backlog = 0 ;
if ( sk_flush_backlog ( sk ) )
goto restart ;
2016-05-03 07:49:25 +03:00
}
2017-10-06 08:21:27 +03:00
first_skb = tcp_rtx_and_write_queues_empty ( sk ) ;
2023-06-09 23:42:46 +03:00
skb = tcp_stream_alloc_skb ( sk , sk - > sk_allocation ,
2021-10-26 01:13:40 +03:00
first_skb ) ;
2014-11-28 21:40:20 +03:00
if ( ! skb )
2020-09-15 00:52:10 +03:00
goto wait_for_space ;
2005-04-17 02:20:36 +04:00
2019-08-09 15:04:47 +03:00
process_backlog + + ;
2005-04-17 02:20:36 +04:00
2021-09-22 20:26:40 +03:00
tcp_skb_entail ( sk , skb ) ;
2014-11-28 21:40:20 +03:00
copy = size_goal ;
2014-08-13 16:03:10 +04:00
2014-11-28 21:40:20 +03:00
/* All packets are restored as if they have
2018-09-21 18:51:50 +03:00
* already been sent . skb_mstamp_ns isn ' t set to
2014-11-28 21:40:20 +03:00
* avoid wrong rtt estimation .
*/
if ( tp - > repair )
TCP_SKB_CB ( skb ) - > sacked | = TCPCB_REPAIRED ;
}
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
/* Try to append data to the end of skb. */
2014-12-16 05:39:31 +03:00
if ( copy > msg_data_left ( msg ) )
copy = msg_data_left ( msg ) ;
2014-11-28 21:40:20 +03:00
2023-05-22 15:11:13 +03:00
if ( zc = = 0 ) {
2014-11-28 21:40:20 +03:00
bool merge = true ;
int i = skb_shinfo ( skb ) - > nr_frags ;
struct page_frag * pfrag = sk_page_frag ( sk ) ;
if ( ! sk_page_frag_refill ( sk , pfrag ) )
2020-09-15 00:52:10 +03:00
goto wait_for_space ;
2005-09-02 04:48:59 +04:00
2014-11-28 21:40:20 +03:00
if ( ! skb_can_coalesce ( skb , i , pfrag - > page ,
pfrag - > offset ) ) {
2022-08-23 20:46:54 +03:00
if ( i > = READ_ONCE ( sysctl_max_skb_frags ) ) {
2014-11-28 21:40:20 +03:00
tcp_mark_push ( tp , skb ) ;
goto new_segment ;
2005-04-17 02:20:36 +04:00
}
2014-11-28 21:40:20 +03:00
merge = false ;
2005-04-17 02:20:36 +04:00
}
2014-11-28 21:40:20 +03:00
copy = min_t ( int , copy , pfrag - > size - pfrag - > offset ) ;
2022-07-12 23:52:35 +03:00
if ( unlikely ( skb_zcopy_pure ( skb ) | | skb_zcopy_managed ( skb ) ) ) {
if ( tcp_downgrade_zcopy_pure ( sk , skb ) )
goto wait_for_space ;
skb_zcopy_downgrade_managed ( skb ) ;
}
2022-06-14 20:17:34 +03:00
copy = tcp_wmem_schedule ( sk , copy ) ;
if ( ! copy )
2020-09-15 00:52:10 +03:00
goto wait_for_space ;
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
err = skb_copy_to_page_nocache ( sk , & msg - > msg_iter , skb ,
pfrag - > page ,
pfrag - > offset ,
copy ) ;
if ( err )
goto do_error ;
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
/* Update the skb. */
if ( merge ) {
skb_frag_size_add ( & skb_shinfo ( skb ) - > frags [ i - 1 ] , copy ) ;
} else {
skb_fill_page_desc ( skb , i , pfrag - > page ,
pfrag - > offset , copy ) ;
2017-02-17 20:11:42 +03:00
page_ref_inc ( pfrag - > page ) ;
net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 06:11:49 +04:00
}
2014-11-28 21:40:20 +03:00
pfrag - > offset + = copy ;
2023-05-22 15:11:13 +03:00
} else if ( zc = = MSG_ZEROCOPY ) {
2021-11-03 05:58:44 +03:00
/* First append to a fragless skb builds initial
* pure zerocopy skb
*/
if ( ! skb - > len )
skb_shinfo ( skb ) - > flags | = SKBFL_PURE_ZEROCOPY ;
if ( ! skb_zcopy_pure ( skb ) ) {
2022-06-14 20:17:34 +03:00
copy = tcp_wmem_schedule ( sk , copy ) ;
if ( ! copy )
2021-11-03 05:58:44 +03:00
goto wait_for_space ;
}
2021-07-09 18:43:06 +03:00
2017-08-03 23:29:44 +03:00
err = skb_zerocopy_iter_stream ( sk , skb , msg , copy , uarg ) ;
2017-12-23 03:00:18 +03:00
if ( err = = - EMSGSIZE | | err = = - EEXIST ) {
tcp_mark_push ( tp , skb ) ;
2017-08-03 23:29:44 +03:00
goto new_segment ;
2017-12-23 03:00:18 +03:00
}
2017-08-03 23:29:44 +03:00
if ( err < 0 )
goto do_error ;
copy = err ;
2023-05-22 15:11:13 +03:00
} else if ( zc = = MSG_SPLICE_PAGES ) {
/* Splice in data if we can; copy if we can't. */
if ( tcp_downgrade_zcopy_pure ( sk , skb ) )
goto wait_for_space ;
copy = tcp_wmem_schedule ( sk , copy ) ;
if ( ! copy )
goto wait_for_space ;
err = skb_splice_from_iter ( skb , & msg - > msg_iter , copy ,
sk - > sk_allocation ) ;
if ( err < 0 ) {
if ( err = = - EMSGSIZE ) {
tcp_mark_push ( tp , skb ) ;
goto new_segment ;
}
goto do_error ;
}
copy = err ;
if ( ! ( flags & MSG_NO_SHARED_FRAGS ) )
skb_shinfo ( skb ) - > flags | = SKBFL_SHARED_FRAG ;
sk_wmem_queued_add ( sk , copy ) ;
sk_mem_charge ( sk , copy ) ;
2014-11-28 21:40:20 +03:00
}
if ( ! copied )
TCP_SKB_CB ( skb ) - > tcp_flags & = ~ TCPHDR_PSH ;
2019-10-11 06:17:41 +03:00
WRITE_ONCE ( tp - > write_seq , tp - > write_seq + copy ) ;
2014-11-28 21:40:20 +03:00
TCP_SKB_CB ( skb ) - > end_seq + = copy ;
tcp_skb_pcount_set ( skb , 0 ) ;
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
copied + = copy ;
2014-12-16 05:39:31 +03:00
if ( ! msg_data_left ( msg ) ) {
tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 00:44:48 +03:00
if ( unlikely ( flags & MSG_EOR ) )
TCP_SKB_CB ( skb ) - > eor = 1 ;
2014-11-28 21:40:20 +03:00
goto out ;
}
2005-04-17 02:20:36 +04:00
2018-02-19 22:56:50 +03:00
if ( skb - > len < size_goal | | ( flags & MSG_OOB ) | | unlikely ( tp - > repair ) )
2005-04-17 02:20:36 +04:00
continue ;
2014-11-28 21:40:20 +03:00
if ( forced_push ( tp ) ) {
tcp_mark_push ( tp , skb ) ;
__tcp_push_pending_frames ( sk , mss_now , TCP_NAGLE_PUSH ) ;
} else if ( skb = = tcp_send_head ( sk ) )
tcp_push_one ( sk , mss_now ) ;
continue ;
2020-09-15 00:52:10 +03:00
wait_for_space :
2014-11-28 21:40:20 +03:00
set_bit ( SOCK_NOSPACE , & sk - > sk_socket - > flags ) ;
if ( copied )
tcp_push ( sk , flags & ~ MSG_MORE , mss_now ,
TCP_NAGLE_PUSH , size_goal ) ;
2005-04-17 02:20:36 +04:00
2015-10-06 20:53:29 +03:00
err = sk_stream_wait_memory ( sk , & timeo ) ;
if ( err ! = 0 )
2014-11-28 21:40:20 +03:00
goto do_error ;
2005-04-17 02:20:36 +04:00
2014-11-28 21:40:20 +03:00
mss_now = tcp_send_mss ( sk , & size_goal , flags ) ;
2005-04-17 02:20:36 +04:00
}
out :
2017-01-04 19:19:34 +03:00
if ( copied ) {
2017-10-06 08:21:23 +03:00
tcp_tx_timestamp ( sk , sockc . tsflags ) ;
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 10:36:05 +04:00
tcp_push ( sk , flags , mss_now , tp - > nonagle , size_goal ) ;
2017-01-04 19:19:34 +03:00
}
tcp: Fix divide by zero when pushing during tcp-repair
When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
tcp_push with mss_now being 0. If data is in the send-queue and
tcp_set_skb_tso_segs gets called, we crash because it will divide by
mss_now:
[ 347.151939] divide error: 0000 [#1] SMP
[ 347.152907] Modules linked in:
[ 347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
[ 347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
[ 347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1
[ 347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
[ 347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
[ 347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
[ 347.152907] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
[ 347.152907] Stack:
[ 347.152907] c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
[ 347.152907] f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
[ 347.152907] c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
[ 347.152907] Call Trace:
[ 347.152907] [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34
[ 347.152907] [<c16013e6>] tcp_init_tso_segs+0x36/0x50
[ 347.152907] [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0
[ 347.152907] [<c10a0188>] ? up+0x28/0x40
[ 347.152907] [<c10acc85>] ? console_unlock+0x295/0x480
[ 347.152907] [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0
[ 347.152907] [<c1605716>] __tcp_push_pending_frames+0x36/0xd0
[ 347.152907] [<c15f4860>] tcp_push+0xf0/0x120
[ 347.152907] [<c15f7641>] tcp_sendmsg+0xf1/0xbf0
[ 347.152907] [<c116d920>] ? kmem_cache_free+0xf0/0x120
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c114f0f0>] ? do_wp_page+0x3e0/0x850
[ 347.152907] [<c161c36a>] inet_sendmsg+0x4a/0xb0
[ 347.152907] [<c1150269>] ? handle_mm_fault+0x709/0xfb0
[ 347.152907] [<c15a006b>] sock_aio_write+0xbb/0xd0
[ 347.152907] [<c1180b79>] do_sync_write+0x69/0xa0
[ 347.152907] [<c1181023>] vfs_write+0x123/0x160
[ 347.152907] [<c1181d55>] SyS_write+0x55/0xb0
[ 347.152907] [<c167f0d8>] sysenter_do_call+0x12/0x28
This can easily be reproduced with the following packetdrill-script (the
"magic" with netem, sk_pacing and limit_output_bytes is done to prevent
the kernel from pushing all segments, because hitting the limit without
doing this is not so easy with packetdrill):
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1460>
+0 > S. 0:0(0) ack 1 <mss 1460>
+0.1 < . 1:1(0) ack 1 win 65000
+0 accept(3, ..., ...) = 4
// This forces that not all segments of the snd-queue will be pushed
+0 `tc qdisc add dev tun0 root netem delay 10ms`
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
+0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0
+0 write(4,...,10000) = 10000
+0 write(4,...,10000) = 10000
// Set tcp-repair stuff, particularly TCP_RECV_QUEUE
+0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
+0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0
// This now will make the write push the remaining segments
+0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`
// Now we will crash
+0 write(4,...,1000) = 1000
This happens since ec3423257508 (tcp: fix retransmission in repair
mode). Prior to that, the call to tcp_push was prevented by a check for
tp->repair.
The patch fixes it, by adding the new goto-label out_nopush. When exiting
tcp_sendmsg and a push is not required, which is the case for tp->repair,
we go to this label.
When repairing and calling send() with TCP_RECV_QUEUE, the data is
actually put in the receive-queue. So, no push is required because no
data has been added to the send-queue.
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Fixes: ec3423257508 (tcp: fix retransmission in repair mode)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-28 20:26:37 +04:00
out_nopush :
2023-05-15 19:06:37 +03:00
/* msg->msg_ubuf is pinned by the caller so we don't take extra refs */
if ( uarg & & ! msg - > msg_ubuf )
net_zcopy_put ( uarg ) ;
2012-07-19 10:43:09 +04:00
return copied + copied_syn ;
2005-04-17 02:20:36 +04:00
2019-08-26 19:19:15 +03:00
do_error :
2021-10-27 23:19:18 +03:00
tcp_remove_empty_skb ( sk ) ;
2005-04-17 02:20:36 +04:00
2012-07-19 10:43:09 +04:00
if ( copied + copied_syn )
2005-04-17 02:20:36 +04:00
goto out ;
out_err :
2023-05-15 19:06:37 +03:00
/* msg->msg_ubuf is pinned by the caller so we don't take extra refs */
if ( uarg & & ! msg - > msg_ubuf )
net_zcopy_put_abort ( uarg , true ) ;
2005-04-17 02:20:36 +04:00
err = sk_stream_error ( sk , flags , err ) ;
2015-05-20 18:52:53 +03:00
/* make sure we wake any epoll edge trigger waiter */
2019-12-12 23:55:31 +03:00
if ( unlikely ( tcp_rtx_and_write_queues_empty ( sk ) & & err = = - EAGAIN ) ) {
2015-05-20 18:52:53 +03:00
sk - > sk_write_space ( sk ) ;
2016-11-28 10:07:16 +03:00
tcp_chrono_stop ( sk , TCP_CHRONO_SNDBUF_LIMITED ) ;
}
2005-04-17 02:20:36 +04:00
return err ;
}
2017-08-17 01:40:44 +03:00
EXPORT_SYMBOL_GPL ( tcp_sendmsg_locked ) ;
2017-07-29 02:22:41 +03:00
int tcp_sendmsg ( struct sock * sk , struct msghdr * msg , size_t size )
{
int ret ;
lock_sock ( sk ) ;
ret = tcp_sendmsg_locked ( sk , msg , size ) ;
release_sock ( sk ) ;
return ret ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_sendmsg ) ;
2005-04-17 02:20:36 +04:00
2023-06-07 21:19:13 +03:00
void tcp_splice_eof ( struct socket * sock )
{
struct sock * sk = sock - > sk ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
int mss_now , size_goal ;
if ( ! tcp_write_queue_tail ( sk ) )
return ;
lock_sock ( sk ) ;
mss_now = tcp_send_mss ( sk , & size_goal , 0 ) ;
tcp_push ( sk , 0 , mss_now , tp - > nonagle , size_goal ) ;
release_sock ( sk ) ;
}
EXPORT_SYMBOL_GPL ( tcp_splice_eof ) ;
2005-04-17 02:20:36 +04:00
/*
* Handle reading urgent data . BSD has very simple semantics for
* this , no blocking and very strange errors 8 )
*/
2009-04-01 01:43:17 +04:00
static int tcp_recv_urg ( struct sock * sk , struct msghdr * msg , int len , int flags )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
/* No URG data to read. */
if ( sock_flag ( sk , SOCK_URGINLINE ) | | ! tp - > urg_data | |
tp - > urg_data = = TCP_URG_READ )
return - EINVAL ; /* Yes this is right ! */
if ( sk - > sk_state = = TCP_CLOSE & & ! sock_flag ( sk , SOCK_DONE ) )
return - ENOTCONN ;
if ( tp - > urg_data & TCP_URG_VALID ) {
int err = 0 ;
char c = tp - > urg_data ;
if ( ! ( flags & MSG_PEEK ) )
2021-11-15 22:02:43 +03:00
WRITE_ONCE ( tp - > urg_data , TCP_URG_READ ) ;
2005-04-17 02:20:36 +04:00
/* Read urgent data. */
msg - > msg_flags | = MSG_OOB ;
if ( len > 0 ) {
if ( ! ( flags & MSG_TRUNC ) )
2014-04-07 05:51:23 +04:00
err = memcpy_to_msg ( msg , & c , 1 ) ;
2005-04-17 02:20:36 +04:00
len = 1 ;
} else
msg - > msg_flags | = MSG_TRUNC ;
return err ? - EFAULT : len ;
}
if ( sk - > sk_state = = TCP_CLOSE | | ( sk - > sk_shutdown & RCV_SHUTDOWN ) )
return 0 ;
/* Fixed the recv(..., MSG_OOB) behaviour. BSD docs and
* the available implementations agree in this case :
* this call should never block , independent of the
* blocking state of the socket .
* Mike < pall @ rz . uni - karlsruhe . de >
*/
return - EAGAIN ;
}
2012-04-19 07:41:01 +04:00
static int tcp_peek_sndq ( struct sock * sk , struct msghdr * msg , int len )
{
struct sk_buff * skb ;
int copied = 0 , err = 0 ;
/* XXX -- need to support SO_PEEK_OFF */
2017-10-06 08:21:27 +03:00
skb_rbtree_walk ( skb , & sk - > tcp_rtx_queue ) {
err = skb_copy_datagram_msg ( skb , 0 , msg , skb - > len ) ;
if ( err )
return err ;
copied + = skb - > len ;
}
2012-04-19 07:41:01 +04:00
skb_queue_walk ( & sk - > sk_write_queue , skb ) {
2014-11-06 00:46:40 +03:00
err = skb_copy_datagram_msg ( skb , 0 , msg , skb - > len ) ;
2012-04-19 07:41:01 +04:00
if ( err )
break ;
copied + = skb - > len ;
}
return err ? : copied ;
}
2005-04-17 02:20:36 +04:00
/* Clean up the receive buffer for full frames taken by the user,
* then send an ACK if necessary . COPIED is the number of bytes
* tcp_recvmsg has given to the user so far , it speeds up the
* calculation of whether or not we must ACK for the sake of
* a window update .
*/
2023-05-23 05:56:12 +03:00
void __tcp_cleanup_rbuf ( struct sock * sk , int copied )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2012-05-17 03:15:34 +04:00
bool time_to_ack = false ;
2005-04-17 02:20:36 +04:00
2005-08-10 07:10:42 +04:00
if ( inet_csk_ack_scheduled ( sk ) ) {
const struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2020-09-30 15:54:56 +03:00
if ( /* Once-per-two-segments ACK was not sent by tcp_input.c */
2005-08-10 07:10:42 +04:00
tp - > rcv_nxt - tp - > rcv_wup > icsk - > icsk_ack . rcv_mss | |
2005-04-17 02:20:36 +04:00
/*
* If this read emptied read buffer , we send ACK , if
* connection is not bidirectional , user drained
* receive buffer and there was a small segment
* in queue .
*/
2006-09-19 23:52:50 +04:00
( copied > 0 & &
( ( icsk - > icsk_ack . pending & ICSK_ACK_PUSHED2 ) | |
( ( icsk - > icsk_ack . pending & ICSK_ACK_PUSHED ) & &
2019-01-25 21:53:19 +03:00
! inet_csk_in_pingpong_mode ( sk ) ) ) & &
2006-09-19 23:52:50 +04:00
! atomic_read ( & sk - > sk_rmem_alloc ) ) )
2012-05-17 03:15:34 +04:00
time_to_ack = true ;
2005-04-17 02:20:36 +04:00
}
/* We send an ACK if we can now advertise a non-zero window
* which has been raised " significantly " .
*
* Even if window raised up to infinity , do not send window open ACK
* in states , where we will not receive more . It is useless .
*/
if ( copied > 0 & & ! time_to_ack & & ! ( sk - > sk_shutdown & RCV_SHUTDOWN ) ) {
__u32 rcv_window_now = tcp_receive_window ( tp ) ;
/* Optimize, __tcp_select_window() is not cheap. */
if ( 2 * rcv_window_now < = tp - > window_clamp ) {
__u32 new_window = __tcp_select_window ( sk ) ;
/* Send ACK now, if this read freed lots of space
* in our buffer . Certainly , new_window is new window .
* We can advertise it now , if it is not less than current one .
* " Lots " means " at least twice " here .
*/
if ( new_window & & new_window > = 2 * rcv_window_now )
2012-05-17 03:15:34 +04:00
time_to_ack = true ;
2005-04-17 02:20:36 +04:00
}
}
if ( time_to_ack )
tcp_send_ack ( sk ) ;
}
2022-08-17 22:54:43 +03:00
void tcp_cleanup_rbuf ( struct sock * sk , int copied )
{
struct sk_buff * skb = skb_peek ( & sk - > sk_receive_queue ) ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
WARN ( skb & & ! before ( tp - > copied_seq , TCP_SKB_CB ( skb ) - > end_seq ) ,
" cleanup rbuf bug: copied %X seq %X rcvnxt %X \n " ,
tp - > copied_seq , TCP_SKB_CB ( skb ) - > end_seq , tp - > rcv_nxt ) ;
__tcp_cleanup_rbuf ( sk , copied ) ;
}
2021-11-15 22:02:45 +03:00
static void tcp_eat_recv_skb ( struct sock * sk , struct sk_buff * skb )
{
tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-15 22:02:46 +03:00
__skb_unlink ( skb , & sk - > sk_receive_queue ) ;
2021-11-15 22:02:45 +03:00
if ( likely ( skb - > destructor = = sock_rfree ) ) {
sock_rfree ( skb ) ;
skb - > destructor = NULL ;
skb - > sk = NULL ;
net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.
But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.
For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.
For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.
Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.
This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.
This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.
In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.
Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.
Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.
Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)
Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.
10 runs of one TCP_STREAM flow
Before:
Average throughput: 49685 Mbit.
Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)
57.81% [kernel] [k] copy_user_enhanced_fast_string
(*) 12.87% [kernel] [k] skb_release_data
(*) 4.25% [kernel] [k] __free_one_page
(*) 3.57% [kernel] [k] __list_del_entry_valid
1.85% [kernel] [k] __netif_receive_skb_core
1.60% [kernel] [k] __skb_datagram_iter
(*) 1.59% [kernel] [k] free_unref_page_commit
(*) 1.16% [kernel] [k] __slab_free
1.16% [kernel] [k] _copy_to_iter
(*) 1.01% [kernel] [k] kfree
(*) 0.88% [kernel] [k] free_unref_page
0.57% [kernel] [k] ip6_rcv_core
0.55% [kernel] [k] ip6t_do_table
0.54% [kernel] [k] flush_smp_call_function_queue
(*) 0.54% [kernel] [k] free_pcppages_bulk
0.51% [kernel] [k] llist_reverse_order
0.38% [kernel] [k] process_backlog
(*) 0.38% [kernel] [k] free_pcp_prepare
0.37% [kernel] [k] tcp_recvmsg_locked
(*) 0.37% [kernel] [k] __list_add_valid
0.34% [kernel] [k] sock_rfree
0.34% [kernel] [k] _raw_spin_lock_irq
(*) 0.33% [kernel] [k] __page_cache_release
0.33% [kernel] [k] tcp_v6_rcv
(*) 0.33% [kernel] [k] __put_page
(*) 0.29% [kernel] [k] __mod_zone_page_state
0.27% [kernel] [k] _raw_spin_lock
After patch:
Average throughput: 73076 Mbit.
Kernel profiles on cpu running user thread recvmsg() looks better:
81.35% [kernel] [k] copy_user_enhanced_fast_string
1.95% [kernel] [k] _copy_to_iter
1.95% [kernel] [k] __skb_datagram_iter
1.27% [kernel] [k] __netif_receive_skb_core
1.03% [kernel] [k] ip6t_do_table
0.60% [kernel] [k] sock_rfree
0.50% [kernel] [k] tcp_v6_rcv
0.47% [kernel] [k] ip6_rcv_core
0.45% [kernel] [k] read_tsc
0.44% [kernel] [k] _raw_spin_lock_irqsave
0.37% [kernel] [k] _raw_spin_lock
0.37% [kernel] [k] native_irq_return_iret
0.33% [kernel] [k] __inet6_lookup_established
0.31% [kernel] [k] ip6_protocol_deliver_rcu
0.29% [kernel] [k] tcp_rcv_established
0.29% [kernel] [k] llist_reverse_order
v2: kdoc issue (kernel bots)
do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
replace the sk_buff_head with a single-linked list (Jakub)
add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-22 23:12:37 +03:00
return skb_attempt_defer_free ( skb ) ;
2021-11-15 22:02:45 +03:00
}
tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-15 22:02:46 +03:00
__kfree_skb ( skb ) ;
2021-11-15 22:02:45 +03:00
}
2022-07-23 02:50:31 +03:00
struct sk_buff * tcp_recv_skb ( struct sock * sk , u32 seq , u32 * off )
2005-04-17 02:20:36 +04:00
{
struct sk_buff * skb ;
u32 offset ;
2013-01-10 00:59:09 +04:00
while ( ( skb = skb_peek ( & sk - > sk_receive_queue ) ) ! = NULL ) {
2005-04-17 02:20:36 +04:00
offset = seq - TCP_SKB_CB ( skb ) - > seq ;
2016-02-02 08:03:08 +03:00
if ( unlikely ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_SYN ) ) {
pr_err_once ( " %s: found a SYN, please report ! \n " , __func__ ) ;
2005-04-17 02:20:36 +04:00
offset - - ;
2016-02-02 08:03:08 +03:00
}
2014-09-15 15:19:51 +04:00
if ( offset < skb - > len | | ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN ) ) {
2005-04-17 02:20:36 +04:00
* off = offset ;
return skb ;
}
2013-01-10 00:59:09 +04:00
/* This looks weird, but this can happen if TCP collapsing
* splitted a fat GRO packet , while we released socket lock
* in skb_splice_bits ( )
*/
2021-11-15 22:02:45 +03:00
tcp_eat_recv_skb ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
}
return NULL ;
}
2022-07-23 02:50:31 +03:00
EXPORT_SYMBOL ( tcp_recv_skb ) ;
2005-04-17 02:20:36 +04:00
/*
* This routine provides an alternative to tcp_recvmsg ( ) for routines
* that would like to handle copying from skbuffs directly in ' sendfile '
* fashion .
* Note :
* - It is assumed that the socket was locked by the caller .
* - The routine does not block .
* - At present , there is no support for reading OOB data
* or for ' peeking ' the socket using this routine
* ( although both would be easy to implement ) .
*/
int tcp_read_sock ( struct sock * sk , read_descriptor_t * desc ,
sk_read_actor_t recv_actor )
{
struct sk_buff * skb ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
u32 seq = tp - > copied_seq ;
u32 offset ;
int copied = 0 ;
if ( sk - > sk_state = = TCP_LISTEN )
return - ENOTCONN ;
while ( ( skb = tcp_recv_skb ( sk , seq , & offset ) ) ! = NULL ) {
if ( offset < skb - > len ) {
2008-07-03 14:31:21 +04:00
int used ;
size_t len ;
2005-04-17 02:20:36 +04:00
len = skb - > len - offset ;
/* Stop reading if we hit a patch of urgent data */
2021-11-15 22:02:44 +03:00
if ( unlikely ( tp - > urg_data ) ) {
2005-04-17 02:20:36 +04:00
u32 urg_offset = tp - > urg_seq - seq ;
if ( urg_offset < len )
len = urg_offset ;
if ( ! len )
break ;
}
used = recv_actor ( desc , skb , offset , len ) ;
2013-01-10 11:06:10 +04:00
if ( used < = 0 ) {
2007-06-24 10:07:50 +04:00
if ( ! copied )
copied = used ;
break ;
2005-04-17 02:20:36 +04:00
}
2022-03-02 19:17:23 +03:00
if ( WARN_ON_ONCE ( used > len ) )
used = len ;
seq + = used ;
copied + = used ;
offset + = used ;
2012-12-02 15:49:27 +04:00
/* If recv_actor drops the lock (e.g. TCP splice
2008-06-05 02:45:58 +04:00
* receive ) the skb pointer might be invalid when
* getting here : tcp_collapse might have deleted it
* while aggregating skbs from the socket queue .
*/
2012-12-02 15:49:27 +04:00
skb = tcp_recv_skb ( sk , seq - 1 , & offset ) ;
if ( ! skb )
2005-04-17 02:20:36 +04:00
break ;
2012-12-02 15:49:27 +04:00
/* TCP coalescing might have appended data to the skb.
* Try to splice more frags
*/
if ( offset + 1 ! = skb - > len )
continue ;
2005-04-17 02:20:36 +04:00
}
2014-09-15 15:19:51 +04:00
if ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN ) {
2021-11-15 22:02:45 +03:00
tcp_eat_recv_skb ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
+ + seq ;
break ;
}
2021-11-15 22:02:45 +03:00
tcp_eat_recv_skb ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
if ( ! desc - > count )
break ;
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( tp - > copied_seq , seq ) ;
2005-04-17 02:20:36 +04:00
}
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( tp - > copied_seq , seq ) ;
2005-04-17 02:20:36 +04:00
tcp_rcv_space_adjust ( sk ) ;
/* Clean up data we have read: This will do ACK frames. */
2013-01-10 00:59:09 +04:00
if ( copied > 0 ) {
tcp_recv_skb ( sk , seq , & offset ) ;
2006-05-24 05:00:16 +04:00
tcp_cleanup_rbuf ( sk , copied ) ;
2013-01-10 00:59:09 +04:00
}
2005-04-17 02:20:36 +04:00
return copied ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_read_sock ) ;
2005-04-17 02:20:36 +04:00
2022-06-15 19:20:12 +03:00
int tcp_read_skb ( struct sock * sk , skb_read_actor_t recv_actor )
2022-06-15 19:20:11 +03:00
{
struct sk_buff * skb ;
int copied = 0 ;
if ( sk - > sk_state = = TCP_LISTEN )
return - ENOTCONN ;
2023-09-26 06:52:58 +03:00
while ( ( skb = skb_peek ( & sk - > sk_receive_queue ) ) ! = NULL ) {
2022-09-12 20:35:53 +03:00
u8 tcp_flags ;
int used ;
2022-06-15 19:20:11 +03:00
2022-09-12 20:35:53 +03:00
__skb_unlink ( skb , & sk - > sk_receive_queue ) ;
WARN_ON_ONCE ( ! skb_set_owner_sk_safe ( skb , sk ) ) ;
tcp_flags = TCP_SKB_CB ( skb ) - > tcp_flags ;
used = recv_actor ( sk , skb ) ;
if ( used < 0 ) {
if ( ! copied )
copied = used ;
break ;
}
copied + = used ;
2022-06-15 19:20:11 +03:00
2023-09-26 06:52:58 +03:00
if ( tcp_flags & TCPHDR_FIN )
2022-09-12 20:35:53 +03:00
break ;
2022-06-15 19:20:11 +03:00
}
return copied ;
}
EXPORT_SYMBOL ( tcp_read_skb ) ;
2022-07-23 02:50:31 +03:00
void tcp_read_done ( struct sock * sk , size_t len )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
u32 seq = tp - > copied_seq ;
struct sk_buff * skb ;
size_t left ;
u32 offset ;
if ( sk - > sk_state = = TCP_LISTEN )
return ;
left = len ;
while ( left & & ( skb = tcp_recv_skb ( sk , seq , & offset ) ) ! = NULL ) {
int used ;
used = min_t ( size_t , skb - > len - offset , left ) ;
seq + = used ;
left - = used ;
if ( skb - > len > offset + used )
break ;
if ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN ) {
tcp_eat_recv_skb ( sk , skb ) ;
+ + seq ;
break ;
}
tcp_eat_recv_skb ( sk , skb ) ;
}
WRITE_ONCE ( tp - > copied_seq , seq ) ;
tcp_rcv_space_adjust ( sk ) ;
/* Clean up data we have read: This will do ACK frames. */
if ( left ! = len )
tcp_cleanup_rbuf ( sk , len - left ) ;
}
EXPORT_SYMBOL ( tcp_read_done ) ;
2016-08-29 00:43:18 +03:00
int tcp_peek_len ( struct socket * sock )
{
return tcp_inq ( sock - > sk ) ;
}
EXPORT_SYMBOL ( tcp_peek_len ) ;
2018-04-16 20:33:35 +03:00
/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
int tcp_set_rcvlowat ( struct sock * sk , int val )
{
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-17 18:29:17 +03:00
int space , cap ;
2018-06-09 05:47:10 +03:00
if ( sk - > sk_userlocks & SOCK_RCVBUF_LOCK )
cap = sk - > sk_rcvbuf > > 1 ;
else
2022-07-22 21:22:00 +03:00
cap = READ_ONCE ( sock_net ( sk ) - > ipv4 . sysctl_tcp_rmem [ 2 ] ) > > 1 ;
2018-06-09 05:47:10 +03:00
val = min ( val , cap ) ;
2019-10-10 01:32:35 +03:00
WRITE_ONCE ( sk - > sk_rcvlowat , val ? : 1 ) ;
tcp: avoid extra wakeups for SO_RCVLOWAT users
SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
generated when enough bytes are available in receive queue, after
David change (commit c7004482e8dc "tcp: Respect SO_RCVLOWAT in tcp_poll().")
But TCP still calls sk->sk_data_ready() for each chunk added in receive
queue, meaning thread is awaken, and goes back to sleep shortly after.
Tested:
tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
-> Should get ~2 wakeups (c-switches) per MB, regardless of how many
(tiny or big) packets were received.
High speed (mostly full size GRO packets)
received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:37 +03:00
/* Check if we need to signal EPOLLIN right now */
tcp_data_ready ( sk ) ;
2018-04-16 20:33:35 +03:00
if ( sk - > sk_userlocks & SOCK_RCVBUF_LOCK )
return 0 ;
tcp: get rid of sysctl_tcp_adv_win_scale
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-17 18:29:17 +03:00
space = tcp_space_from_win ( sk , val ) ;
if ( space > sk - > sk_rcvbuf ) {
WRITE_ONCE ( sk - > sk_rcvbuf , space ) ;
tcp_sk ( sk ) - > window_clamp = val ;
2018-04-16 20:33:35 +03:00
}
return 0 ;
}
EXPORT_SYMBOL ( tcp_set_rcvlowat ) ;
2021-06-04 02:24:31 +03:00
void tcp_update_recv_tstamps ( struct sk_buff * skb ,
struct scm_timestamping_internal * tss )
2021-01-21 03:41:48 +03:00
{
if ( skb - > tstamp )
tss - > ts [ 0 ] = ktime_to_timespec64 ( skb - > tstamp ) ;
else
tss - > ts [ 0 ] = ( struct timespec64 ) { 0 } ;
if ( skb_hwtstamps ( skb ) - > hwtstamp )
tss - > ts [ 2 ] = ktime_to_timespec64 ( skb_hwtstamps ( skb ) - > hwtstamp ) ;
else
tss - > ts [ 2 ] = ( struct timespec64 ) { 0 } ;
}
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
# ifdef CONFIG_MMU
2023-07-24 21:54:02 +03:00
static const struct vm_operations_struct tcp_vm_ops = {
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
} ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
int tcp_mmap ( struct file * file , struct socket * sock ,
struct vm_area_struct * vma )
{
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
if ( vma - > vm_flags & ( VM_WRITE | VM_EXEC ) )
return - EPERM ;
2023-01-26 22:37:49 +03:00
vm_flags_clear ( vma , VM_MAYWRITE | VM_MAYEXEC ) ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
2020-06-09 07:33:51 +03:00
/* Instruct vm_insert_page() to not mmap_read_lock(mm) */
2023-01-26 22:37:49 +03:00
vm_flags_set ( vma , VM_MIXEDMAP ) ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
vma - > vm_ops = & tcp_vm_ops ;
return 0 ;
}
EXPORT_SYMBOL ( tcp_mmap ) ;
2020-12-03 01:53:44 +03:00
static skb_frag_t * skb_advance_to_frag ( struct sk_buff * skb , u32 offset_skb ,
u32 * offset_frag )
{
skb_frag_t * frag ;
tcp: Fix uninitialized access in skb frags array for Rx 0cp.
TCP Receive zerocopy iterates through the SKB queue via
tcp_recv_skb(), acquiring a pointer to an SKB and an offset within
that SKB to read from. From there, it iterates the SKB frags array to
determine which offset to start remapping pages from.
However, this is built on the assumption that the offset read so far
within the SKB is smaller than the SKB length. If this assumption is
violated, we can attempt to read an invalid frags array element, which
would cause a fault.
tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN
flag is set. Therefore, we must guard against this occurrence inside
skb_advance_frag().
One way that we can reproduce this error follows:
1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with:
char some_array[32 * 1024];
struct tcp_zerocopy_receive zc = {
.copybuf_address = (__u64) &some_array[0],
.copybuf_len = 32 * 1024,
};
2) In a sender program, after a TCP handshake, send the following
sequence of packets:
i) Seq = [X, X+4000]
ii) Seq = [X+4000, X+5000]
iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000
(This can happen without URG, if we have a signal pending, but URG is
a convenient way to reproduce the behaviour).
In this case, the following event sequence will occur on the receiver:
tcp_zerocopy_receive():
-> receive_fallback_to_copy() // copybuf_len >= inq
-> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG
-> tcp_recv_skb() // yields skb with skb->len == offset
-> tcp_zerocopy_set_hint_for_skb()
-> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags
-> find_next_mappable_frag() // will dereference this bad frags ptr.
With this patch, skb_advance_to_frag() will no longer return an
invalid frags pointer, and will return NULL instead, fixing the issue.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-12 02:52:15 +03:00
if ( unlikely ( offset_skb > = skb - > len ) )
return NULL ;
2020-12-03 01:53:44 +03:00
offset_skb - = skb_headlen ( skb ) ;
if ( ( int ) offset_skb < 0 | | skb_has_frag_list ( skb ) )
return NULL ;
frag = skb_shinfo ( skb ) - > frags ;
while ( offset_skb ) {
if ( skb_frag_size ( frag ) > offset_skb ) {
* offset_frag = offset_skb ;
return frag ;
}
offset_skb - = skb_frag_size ( frag ) ;
+ + frag ;
}
* offset_frag = 0 ;
return frag ;
}
2020-12-03 01:53:45 +03:00
static bool can_map_frag ( const skb_frag_t * frag )
{
return skb_frag_size ( frag ) = = PAGE_SIZE & & ! skb_frag_off ( frag ) ;
}
static int find_next_mappable_frag ( const skb_frag_t * frag ,
int remaining_in_skb )
{
int offset = 0 ;
if ( likely ( can_map_frag ( frag ) ) )
return 0 ;
while ( offset < remaining_in_skb & & ! can_map_frag ( frag ) ) {
offset + = skb_frag_size ( frag ) ;
+ + frag ;
}
return offset ;
}
2020-12-03 01:53:48 +03:00
static void tcp_zerocopy_set_hint_for_skb ( struct sock * sk ,
struct tcp_zerocopy_receive * zc ,
struct sk_buff * skb , u32 offset )
{
u32 frag_offset , partial_frag_remainder = 0 ;
int mappable_offset ;
skb_frag_t * frag ;
/* worst case: skip to next skb. try to improve on this case below */
zc - > recv_skip_hint = skb - > len - offset ;
/* Find the frag containing this offset (and how far into that frag) */
frag = skb_advance_to_frag ( skb , offset , & frag_offset ) ;
if ( ! frag )
return ;
if ( frag_offset ) {
struct skb_shared_info * info = skb_shinfo ( skb ) ;
/* We read part of the last frag, must recvmsg() rest of skb. */
if ( frag = = & info - > frags [ info - > nr_frags - 1 ] )
return ;
/* Else, we must at least read the remainder in this frag. */
partial_frag_remainder = skb_frag_size ( frag ) - frag_offset ;
zc - > recv_skip_hint - = partial_frag_remainder ;
+ + frag ;
}
/* partial_frag_remainder: If part way through a frag, must read rest.
* mappable_offset : Bytes till next mappable frag , * not * counting bytes
* in partial_frag_remainder .
*/
mappable_offset = find_next_mappable_frag ( frag , zc - > recv_skip_hint ) ;
zc - > recv_skip_hint = mappable_offset + partial_frag_remainder ;
}
2020-12-03 01:53:47 +03:00
static int tcp_recvmsg_locked ( struct sock * sk , struct msghdr * msg , size_t len ,
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
int flags , struct scm_timestamping_internal * tss ,
2020-12-03 01:53:47 +03:00
int * cmsg_flags ) ;
static int receive_fallback_to_copy ( struct sock * sk ,
2021-01-21 03:41:48 +03:00
struct tcp_zerocopy_receive * zc , int inq ,
struct scm_timestamping_internal * tss )
2020-12-03 01:53:47 +03:00
{
unsigned long copy_address = ( unsigned long ) zc - > copybuf_address ;
struct msghdr msg = { } ;
struct iovec iov ;
2021-01-21 03:41:48 +03:00
int err ;
2020-12-03 01:53:47 +03:00
zc - > length = 0 ;
zc - > recv_skip_hint = 0 ;
if ( copy_address ! = zc - > copybuf_address )
return - EINVAL ;
2022-09-16 03:25:47 +03:00
err = import_single_range ( ITER_DEST , ( void __user * ) copy_address ,
2020-12-03 01:53:47 +03:00
inq , & iov , & msg . msg_iter ) ;
if ( err )
return err ;
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
err = tcp_recvmsg_locked ( sk , & msg , inq , MSG_DONTWAIT ,
2021-01-21 03:41:48 +03:00
tss , & zc - > msg_flags ) ;
2020-12-03 01:53:47 +03:00
if ( err < 0 )
return err ;
zc - > copybuf_len = err ;
2020-12-03 01:53:48 +03:00
if ( likely ( zc - > copybuf_len ) ) {
struct sk_buff * skb ;
u32 offset ;
skb = tcp_recv_skb ( sk , tcp_sk ( sk ) - > copied_seq , & offset ) ;
if ( skb )
tcp_zerocopy_set_hint_for_skb ( sk , zc , skb , offset ) ;
}
2020-12-03 01:53:47 +03:00
return 0 ;
}
2020-12-03 01:53:42 +03:00
static int tcp_copy_straggler_data ( struct tcp_zerocopy_receive * zc ,
struct sk_buff * skb , u32 copylen ,
u32 * offset , u32 * seq )
{
unsigned long copy_address = ( unsigned long ) zc - > copybuf_address ;
struct msghdr msg = { } ;
struct iovec iov ;
int err ;
if ( copy_address ! = zc - > copybuf_address )
return - EINVAL ;
2022-09-16 03:25:47 +03:00
err = import_single_range ( ITER_DEST , ( void __user * ) copy_address ,
2020-12-03 01:53:42 +03:00
copylen , & iov , & msg . msg_iter ) ;
if ( err )
return err ;
err = skb_copy_datagram_msg ( skb , * offset , & msg , copylen ) ;
if ( err )
return err ;
zc - > recv_skip_hint - = copylen ;
* offset + = copylen ;
* seq + = copylen ;
return ( __s32 ) copylen ;
}
2021-01-21 03:41:48 +03:00
static int tcp_zc_handle_leftover ( struct tcp_zerocopy_receive * zc ,
struct sock * sk ,
struct sk_buff * skb ,
u32 * seq ,
s32 copybuf_len ,
struct scm_timestamping_internal * tss )
2020-12-03 01:53:42 +03:00
{
u32 offset , copylen = min_t ( u32 , copybuf_len , zc - > recv_skip_hint ) ;
if ( ! copylen )
return 0 ;
/* skb is null if inq < PAGE_SIZE. */
2021-01-21 03:41:48 +03:00
if ( skb ) {
2020-12-03 01:53:42 +03:00
offset = * seq - TCP_SKB_CB ( skb ) - > seq ;
2021-01-21 03:41:48 +03:00
} else {
2020-12-03 01:53:42 +03:00
skb = tcp_recv_skb ( sk , * seq , & offset ) ;
2021-01-21 03:41:48 +03:00
if ( TCP_SKB_CB ( skb ) - > has_rxtstamp ) {
tcp_update_recv_tstamps ( skb , tss ) ;
zc - > msg_flags | = TCP_CMSG_TS ;
}
}
2020-12-03 01:53:42 +03:00
zc - > copybuf_len = tcp_copy_straggler_data ( zc , skb , copylen , & offset ,
seq ) ;
return zc - > copybuf_len < 0 ? 0 : copylen ;
}
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
static int tcp_zerocopy_vm_insert_batch_error ( struct vm_area_struct * vma ,
struct page * * pending_pages ,
unsigned long pages_remaining ,
unsigned long * address ,
u32 * length ,
u32 * seq ,
struct tcp_zerocopy_receive * zc ,
u32 total_bytes_to_map ,
int err )
{
/* At least one page did not map. Try zapping if we skipped earlier. */
if ( err = = - EBUSY & &
zc - > flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT ) {
u32 maybe_zap_len ;
maybe_zap_len = total_bytes_to_map - /* All bytes to map */
* length + /* Mapped or pending */
( pages_remaining * PAGE_SIZE ) ; /* Failed map. */
2023-01-04 03:27:32 +03:00
zap_page_range_single ( vma , * address , maybe_zap_len , NULL ) ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
err = 0 ;
}
if ( ! err ) {
unsigned long leftover_pages = pages_remaining ;
int bytes_mapped ;
2023-01-04 03:27:32 +03:00
/* We called zap_page_range_single, try to reinsert. */
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
err = vm_insert_pages ( vma , * address ,
pending_pages ,
& pages_remaining ) ;
bytes_mapped = PAGE_SIZE * ( leftover_pages - pages_remaining ) ;
* seq + = bytes_mapped ;
* address + = bytes_mapped ;
}
if ( err ) {
/* Either we were unable to zap, OR we zapped, retried an
* insert , and still had an issue . Either ways , pages_remaining
* is the number of pages we were unable to map , and we unroll
* some state we speculatively touched before .
*/
const int bytes_not_mapped = PAGE_SIZE * pages_remaining ;
* length - = bytes_not_mapped ;
zc - > recv_skip_hint + = bytes_not_mapped ;
}
return err ;
}
2020-06-08 04:54:41 +03:00
static int tcp_zerocopy_vm_insert_batch ( struct vm_area_struct * vma ,
struct page * * pages ,
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
unsigned int pages_to_map ,
unsigned long * address ,
u32 * length ,
2020-06-08 04:54:41 +03:00
u32 * seq ,
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
struct tcp_zerocopy_receive * zc ,
u32 total_bytes_to_map )
2020-06-08 04:54:41 +03:00
{
unsigned long pages_remaining = pages_to_map ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
unsigned int pages_mapped ;
unsigned int bytes_mapped ;
int err ;
2020-06-08 04:54:41 +03:00
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
err = vm_insert_pages ( vma , * address , pages , & pages_remaining ) ;
pages_mapped = pages_to_map - ( unsigned int ) pages_remaining ;
bytes_mapped = PAGE_SIZE * pages_mapped ;
2020-06-08 04:54:41 +03:00
/* Even if vm_insert_pages fails, it may have partially succeeded in
* mapping ( some but not all of the pages ) .
*/
* seq + = bytes_mapped ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
* address + = bytes_mapped ;
if ( likely ( ! err ) )
return 0 ;
/* Error: maybe zap and retry + rollback state for failed inserts. */
return tcp_zerocopy_vm_insert_batch_error ( vma , pages + pages_mapped ,
pages_remaining , address , length , seq , zc , total_bytes_to_map ,
err ) ;
2020-06-08 04:54:41 +03:00
}
2021-02-12 00:21:07 +03:00
# define TCP_VALID_ZC_MSG_FLAGS (TCP_CMSG_TS)
2021-01-21 03:41:48 +03:00
static void tcp_zc_finalize_rx_tstamp ( struct sock * sk ,
struct tcp_zerocopy_receive * zc ,
struct scm_timestamping_internal * tss )
{
unsigned long msg_control_addr ;
struct msghdr cmsg_dummy ;
msg_control_addr = ( unsigned long ) zc - > msg_control ;
2023-04-13 14:47:03 +03:00
cmsg_dummy . msg_control_user = ( void __user * ) msg_control_addr ;
2021-01-21 03:41:48 +03:00
cmsg_dummy . msg_controllen =
( __kernel_size_t ) zc - > msg_controllen ;
cmsg_dummy . msg_flags = in_compat_syscall ( )
? MSG_CMSG_COMPAT : 0 ;
2021-05-07 01:35:30 +03:00
cmsg_dummy . msg_control_is_user = true ;
2021-01-21 03:41:48 +03:00
zc - > msg_flags = 0 ;
if ( zc - > msg_control = = msg_control_addr & &
zc - > msg_controllen = = cmsg_dummy . msg_controllen ) {
tcp_recv_timestamp ( & cmsg_dummy , sk , tss ) ;
zc - > msg_control = ( __u64 )
2023-04-13 14:47:03 +03:00
( ( uintptr_t ) cmsg_dummy . msg_control_user ) ;
2021-01-21 03:41:48 +03:00
zc - > msg_controllen =
( __u64 ) cmsg_dummy . msg_controllen ;
zc - > msg_flags = ( __u32 ) cmsg_dummy . msg_flags ;
}
}
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
static struct vm_area_struct * find_tcp_vma ( struct mm_struct * mm ,
unsigned long address ,
bool * mmap_locked )
{
2023-07-24 21:54:02 +03:00
struct vm_area_struct * vma = lock_vma_under_rcu ( mm , address ) ;
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
if ( vma ) {
2023-07-24 21:54:02 +03:00
if ( vma - > vm_ops ! = & tcp_vm_ops ) {
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
vma_end_read ( vma ) ;
return NULL ;
}
* mmap_locked = false ;
return vma ;
}
mmap_read_lock ( mm ) ;
vma = vma_lookup ( mm , address ) ;
2023-07-24 21:54:02 +03:00
if ( ! vma | | vma - > vm_ops ! = & tcp_vm_ops ) {
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
mmap_read_unlock ( mm ) ;
return NULL ;
}
* mmap_locked = true ;
return vma ;
}
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
# define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
static int tcp_zerocopy_receive ( struct sock * sk ,
2021-01-21 03:41:48 +03:00
struct tcp_zerocopy_receive * zc ,
struct scm_timestamping_internal * tss )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
{
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
u32 length = 0 , offset , vma_len , avail_len , copylen = 0 ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
unsigned long address = ( unsigned long ) zc - > address ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
struct page * pages [ TCP_ZEROCOPY_PAGE_BATCH_SIZE ] ;
2020-12-03 01:53:42 +03:00
s32 copybuf_len = zc - > copybuf_len ;
struct tcp_sock * tp = tcp_sk ( sk ) ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
const skb_frag_t * frags = NULL ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
unsigned int pages_to_map = 0 ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
struct vm_area_struct * vma ;
struct sk_buff * skb = NULL ;
2020-12-03 01:53:42 +03:00
u32 seq = tp - > copied_seq ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
u32 total_bytes_to_map ;
2020-12-03 01:53:42 +03:00
int inq = tcp_inq ( sk ) ;
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
bool mmap_locked ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
int ret ;
2020-12-03 01:53:42 +03:00
zc - > copybuf_len = 0 ;
2021-01-21 03:41:48 +03:00
zc - > msg_flags = 0 ;
2020-12-03 01:53:42 +03:00
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
if ( address & ( PAGE_SIZE - 1 ) | | address ! = zc - > address )
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
return - EINVAL ;
if ( sk - > sk_state = = TCP_LISTEN )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
return - ENOTCONN ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
sock_rps_record_flow ( sk ) ;
2020-12-03 01:53:47 +03:00
if ( inq & & inq < = copybuf_len )
2021-01-21 03:41:48 +03:00
return receive_fallback_to_copy ( sk , zc , inq , tss ) ;
2020-12-03 01:53:47 +03:00
2020-12-03 01:53:46 +03:00
if ( inq < PAGE_SIZE ) {
zc - > length = 0 ;
zc - > recv_skip_hint = inq ;
if ( ! inq & & sock_flag ( sk , SOCK_DONE ) )
return - EIO ;
return 0 ;
}
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
vma = find_tcp_vma ( current - > mm , address , & mmap_locked ) ;
if ( ! vma )
2020-05-14 23:58:13 +03:00
return - EINVAL ;
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
2020-12-03 01:53:42 +03:00
vma_len = min_t ( unsigned long , zc - > length , vma - > vm_end - address ) ;
avail_len = min_t ( u32 , vma_len , inq ) ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
total_bytes_to_map = avail_len & ~ ( PAGE_SIZE - 1 ) ;
if ( total_bytes_to_map ) {
if ( ! ( zc - > flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT ) )
2023-01-04 03:27:32 +03:00
zap_page_range_single ( vma , address , total_bytes_to_map ,
NULL ) ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
zc - > length = total_bytes_to_map ;
2018-09-26 23:57:03 +03:00
zc - > recv_skip_hint = 0 ;
} else {
2020-12-03 01:53:42 +03:00
zc - > length = avail_len ;
zc - > recv_skip_hint = avail_len ;
2018-09-26 23:57:03 +03:00
}
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
ret = 0 ;
while ( length + PAGE_SIZE < = zc - > length ) {
2020-12-03 01:53:45 +03:00
int mappable_offset ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
struct page * page ;
2020-12-03 01:53:45 +03:00
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
if ( zc - > recv_skip_hint < PAGE_SIZE ) {
2020-12-03 01:53:44 +03:00
u32 offset_frag ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
if ( skb ) {
2019-12-15 22:54:51 +03:00
if ( zc - > recv_skip_hint > 0 )
break ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
skb = skb - > next ;
offset = seq - TCP_SKB_CB ( skb ) - > seq ;
} else {
skb = tcp_recv_skb ( sk , seq , & offset ) ;
}
2021-01-21 03:41:48 +03:00
if ( TCP_SKB_CB ( skb ) - > has_rxtstamp ) {
tcp_update_recv_tstamps ( skb , tss ) ;
zc - > msg_flags | = TCP_CMSG_TS ;
}
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
zc - > recv_skip_hint = skb - > len - offset ;
2020-12-03 01:53:44 +03:00
frags = skb_advance_to_frag ( skb , offset , & offset_frag ) ;
if ( ! frags | | offset_frag )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
break ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
}
2018-09-26 23:57:04 +03:00
2020-12-03 01:53:45 +03:00
mappable_offset = find_next_mappable_frag ( frags ,
zc - > recv_skip_hint ) ;
if ( mappable_offset ) {
zc - > recv_skip_hint = mappable_offset ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
break ;
2018-09-26 23:57:04 +03:00
}
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
page = skb_frag_page ( frags ) ;
prefetchw ( page ) ;
pages [ pages_to_map + + ] = page ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
length + = PAGE_SIZE ;
zc - > recv_skip_hint - = PAGE_SIZE ;
frags + + ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
if ( pages_to_map = = TCP_ZEROCOPY_PAGE_BATCH_SIZE | |
zc - > recv_skip_hint < PAGE_SIZE ) {
/* Either full batch, or we're about to go to next skb
* ( and we cannot unroll failed ops across skbs ) .
*/
ret = tcp_zerocopy_vm_insert_batch ( vma , pages ,
pages_to_map ,
& address , & length ,
& seq , zc ,
total_bytes_to_map ) ;
2020-06-08 04:54:41 +03:00
if ( ret )
goto out ;
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
pages_to_map = 0 ;
2020-06-08 04:54:41 +03:00
}
}
net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 01:53:49 +03:00
if ( pages_to_map ) {
ret = tcp_zerocopy_vm_insert_batch ( vma , pages , pages_to_map ,
& address , & length , & seq ,
zc , total_bytes_to_map ) ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
}
out :
tcp: Use per-vma locking for receive zerocopy
Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.
Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.
Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.
However, with per-vma locking, both of these problems can be avoided.
As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.
When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16 22:34:27 +03:00
if ( mmap_locked )
mmap_read_unlock ( current - > mm ) ;
else
vma_end_read ( vma ) ;
2020-12-03 01:53:42 +03:00
/* Try to copy straggler data. */
if ( ! ret )
2021-01-21 03:41:48 +03:00
copylen = tcp_zc_handle_leftover ( zc , sk , skb , & seq , copybuf_len , tss ) ;
2020-12-03 01:53:42 +03:00
if ( length + copylen ) {
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( tp - > copied_seq , seq ) ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
tcp_rcv_space_adjust ( sk ) ;
/* Clean up data we have read: This will do ACK frames. */
tcp_recv_skb ( sk , seq , & offset ) ;
2020-12-03 01:53:42 +03:00
tcp_cleanup_rbuf ( sk , length + copylen ) ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
ret = 0 ;
if ( length = = zc - > length )
zc - > recv_skip_hint = 0 ;
} else {
if ( ! zc - > recv_skip_hint & & sock_flag ( sk , SOCK_DONE ) )
ret = - EIO ;
}
zc - > length = length ;
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
return ret ;
}
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
# endif
tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.
Implement mmap() system call so that applications can avoid
copying data without complex splice() games.
Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)
Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.
If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.
Application must fallback to recvmsg() to read the problematic sequence.
mmap() wont block, regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.
An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()
On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.
Tested:
mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
Without mmap() (tcp_mmap -s)
received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
With mmap() on receiver (tcp_mmap -s -z)
received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 20:33:38 +03:00
2017-08-23 00:08:48 +03:00
/* Similar to __sock_recv_timestamp, but does not require an skb */
2021-06-04 02:24:31 +03:00
void tcp_recv_timestamp ( struct msghdr * msg , const struct sock * sk ,
struct scm_timestamping_internal * tss )
2017-08-23 00:08:48 +03:00
{
2019-02-02 18:34:50 +03:00
int new_tstamp = sock_flag ( sk , SOCK_TSTAMP_NEW ) ;
2017-08-23 00:08:48 +03:00
bool has_timestamping = false ;
if ( tss - > ts [ 0 ] . tv_sec | | tss - > ts [ 0 ] . tv_nsec ) {
if ( sock_flag ( sk , SOCK_RCVTSTAMP ) ) {
if ( sock_flag ( sk , SOCK_RCVTSTAMPNS ) ) {
2019-02-02 18:34:50 +03:00
if ( new_tstamp ) {
2019-10-25 23:04:46 +03:00
struct __kernel_timespec kts = {
. tv_sec = tss - > ts [ 0 ] . tv_sec ,
. tv_nsec = tss - > ts [ 0 ] . tv_nsec ,
} ;
2019-02-02 18:34:50 +03:00
put_cmsg ( msg , SOL_SOCKET , SO_TIMESTAMPNS_NEW ,
sizeof ( kts ) , & kts ) ;
} else {
2019-10-25 23:04:46 +03:00
struct __kernel_old_timespec ts_old = {
. tv_sec = tss - > ts [ 0 ] . tv_sec ,
. tv_nsec = tss - > ts [ 0 ] . tv_nsec ,
} ;
2019-02-02 18:34:50 +03:00
put_cmsg ( msg , SOL_SOCKET , SO_TIMESTAMPNS_OLD ,
2019-02-02 18:34:51 +03:00
sizeof ( ts_old ) , & ts_old ) ;
2019-02-02 18:34:50 +03:00
}
2017-08-23 00:08:48 +03:00
} else {
2019-02-02 18:34:50 +03:00
if ( new_tstamp ) {
2019-10-25 23:04:46 +03:00
struct __kernel_sock_timeval stv = {
. tv_sec = tss - > ts [ 0 ] . tv_sec ,
. tv_usec = tss - > ts [ 0 ] . tv_nsec / 1000 ,
} ;
2019-02-02 18:34:50 +03:00
put_cmsg ( msg , SOL_SOCKET , SO_TIMESTAMP_NEW ,
sizeof ( stv ) , & stv ) ;
} else {
2019-10-25 23:04:46 +03:00
struct __kernel_old_timeval tv = {
. tv_sec = tss - > ts [ 0 ] . tv_sec ,
. tv_usec = tss - > ts [ 0 ] . tv_nsec / 1000 ,
} ;
2019-02-02 18:34:50 +03:00
put_cmsg ( msg , SOL_SOCKET , SO_TIMESTAMP_OLD ,
sizeof ( tv ) , & tv ) ;
}
2017-08-23 00:08:48 +03:00
}
}
2023-08-31 16:52:11 +03:00
if ( READ_ONCE ( sk - > sk_tsflags ) & SOF_TIMESTAMPING_SOFTWARE )
2017-08-23 00:08:48 +03:00
has_timestamping = true ;
else
2019-02-02 18:34:51 +03:00
tss - > ts [ 0 ] = ( struct timespec64 ) { 0 } ;
2017-08-23 00:08:48 +03:00
}
if ( tss - > ts [ 2 ] . tv_sec | | tss - > ts [ 2 ] . tv_nsec ) {
2023-08-31 16:52:11 +03:00
if ( READ_ONCE ( sk - > sk_tsflags ) & SOF_TIMESTAMPING_RAW_HARDWARE )
2017-08-23 00:08:48 +03:00
has_timestamping = true ;
else
2019-02-02 18:34:51 +03:00
tss - > ts [ 2 ] = ( struct timespec64 ) { 0 } ;
2017-08-23 00:08:48 +03:00
}
if ( has_timestamping ) {
2019-02-02 18:34:51 +03:00
tss - > ts [ 1 ] = ( struct timespec64 ) { 0 } ;
if ( sock_flag ( sk , SOCK_TSTAMP_NEW ) )
put_cmsg_scm_timestamping64 ( msg , tss ) ;
else
put_cmsg_scm_timestamping ( msg , tss ) ;
2017-08-23 00:08:48 +03:00
}
}
tcp: send in-queue bytes in cmsg upon read
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 22:39:15 +03:00
static int tcp_inq_hint ( struct sock * sk )
{
const struct tcp_sock * tp = tcp_sk ( sk ) ;
u32 copied_seq = READ_ONCE ( tp - > copied_seq ) ;
u32 rcv_nxt = READ_ONCE ( tp - > rcv_nxt ) ;
int inq ;
inq = rcv_nxt - copied_seq ;
if ( unlikely ( inq < 0 | | copied_seq ! = READ_ONCE ( tp - > copied_seq ) ) ) {
lock_sock ( sk ) ;
inq = tp - > rcv_nxt - tp - > copied_seq ;
release_sock ( sk ) ;
}
2019-03-06 21:01:36 +03:00
/* After receiving a FIN, tell the user-space to continue reading
* by returning a non - zero inq .
*/
if ( inq = = 0 & & sock_flag ( sk , SOCK_DONE ) )
inq = 1 ;
tcp: send in-queue bytes in cmsg upon read
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 22:39:15 +03:00
return inq ;
}
2005-04-17 02:20:36 +04:00
/*
* This routine copies from a sock struct into the user buffer .
*
* Technical note : in 2.3 we work on _locked_ socket , so that
* tricks with * seq access order and skb - > users are not required .
* Probably , code can be easily improved even more .
*/
2020-12-03 01:53:43 +03:00
static int tcp_recvmsg_locked ( struct sock * sk , struct msghdr * msg , size_t len ,
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
int flags , struct scm_timestamping_internal * tss ,
2020-12-03 01:53:43 +03:00
int * cmsg_flags )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
int copied = 0 ;
u32 peek_seq ;
u32 * seq ;
unsigned long used ;
2020-12-03 01:53:43 +03:00
int err ;
2005-04-17 02:20:36 +04:00
int target ; /* Read at least this many bytes */
long timeo ;
2015-07-24 19:19:25 +03:00
struct sk_buff * skb , * last ;
2009-05-11 00:32:34 +04:00
u32 urg_hole = 0 ;
2005-04-17 02:20:36 +04:00
err = - ENOTCONN ;
if ( sk - > sk_state = = TCP_LISTEN )
goto out ;
2022-04-29 03:45:06 +03:00
if ( tp - > recvmsg_inq ) {
2021-01-21 03:41:47 +03:00
* cmsg_flags = TCP_CMSG_INQ ;
2022-04-29 03:45:06 +03:00
msg - > msg_get_inq = 1 ;
}
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
timeo = sock_rcvtimeo ( sk , flags & MSG_DONTWAIT ) ;
2005-04-17 02:20:36 +04:00
/* Urgent data needs to be handled specially. */
if ( flags & MSG_OOB )
goto recv_urg ;
2012-04-19 07:41:01 +04:00
if ( unlikely ( tp - > repair ) ) {
err = - EPERM ;
if ( ! ( flags & MSG_PEEK ) )
goto out ;
if ( tp - > repair_queue = = TCP_SEND_QUEUE )
goto recv_sndq ;
err = - EINVAL ;
if ( tp - > repair_queue = = TCP_NO_QUEUE )
goto out ;
/* 'common' recv queue MSG_PEEK-ing */
}
2005-04-17 02:20:36 +04:00
seq = & tp - > copied_seq ;
if ( flags & MSG_PEEK ) {
peek_seq = tp - > copied_seq ;
seq = & peek_seq ;
}
target = sock_rcvlowat ( sk , flags & MSG_WAITALL , len ) ;
do {
u32 offset ;
/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
2021-11-15 22:02:44 +03:00
if ( unlikely ( tp - > urg_data ) & & tp - > urg_seq = = * seq ) {
2005-04-17 02:20:36 +04:00
if ( copied )
break ;
if ( signal_pending ( current ) ) {
copied = timeo ? sock_intr_errno ( timeo ) : - EAGAIN ;
break ;
}
}
/* Next get a buffer. */
2015-07-24 19:19:25 +03:00
last = skb_peek_tail ( & sk - > sk_receive_queue ) ;
2009-05-29 08:35:47 +04:00
skb_queue_walk ( & sk - > sk_receive_queue , skb ) {
2015-07-24 19:19:25 +03:00
last = skb ;
2005-04-17 02:20:36 +04:00
/* Now that we have two receive queues this
* shouldn ' t happen .
*/
2009-11-14 00:56:33 +03:00
if ( WARN ( before ( * seq , TCP_SKB_CB ( skb ) - > seq ) ,
2018-07-18 04:27:45 +03:00
" TCP recvmsg seq # bug: copied %X, seq %X, rcvnxt %X, fl %X \n " ,
2010-10-30 15:08:53 +04:00
* seq , TCP_SKB_CB ( skb ) - > seq , tp - > rcv_nxt ,
flags ) )
2005-04-17 02:20:36 +04:00
break ;
2009-11-14 00:56:33 +03:00
2005-04-17 02:20:36 +04:00
offset = * seq - TCP_SKB_CB ( skb ) - > seq ;
2016-02-02 08:03:08 +03:00
if ( unlikely ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_SYN ) ) {
pr_err_once ( " %s: found a SYN, please report ! \n " , __func__ ) ;
2005-04-17 02:20:36 +04:00
offset - - ;
2016-02-02 08:03:08 +03:00
}
2005-04-17 02:20:36 +04:00
if ( offset < skb - > len )
goto found_ok_skb ;
2014-09-15 15:19:51 +04:00
if ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN )
2005-04-17 02:20:36 +04:00
goto found_fin_ok ;
2010-10-30 15:08:53 +04:00
WARN ( ! ( flags & MSG_PEEK ) ,
2018-07-18 04:27:45 +03:00
" TCP recvmsg seq # bug 2: copied %X, seq %X, rcvnxt %X, fl %X \n " ,
2010-10-30 15:08:53 +04:00
* seq , TCP_SKB_CB ( skb ) - > seq , tp - > rcv_nxt , flags ) ;
2009-05-29 08:35:47 +04:00
}
2005-04-17 02:20:36 +04:00
/* Well, if we have backlog, try to process it now yet. */
2019-11-06 21:04:11 +03:00
if ( copied > = target & & ! READ_ONCE ( sk - > sk_backlog . tail ) )
2005-04-17 02:20:36 +04:00
break ;
if ( copied ) {
2021-11-15 22:02:47 +03:00
if ( ! timeo | |
sk - > sk_err | |
2005-04-17 02:20:36 +04:00
sk - > sk_state = = TCP_CLOSE | |
( sk - > sk_shutdown & RCV_SHUTDOWN ) | |
2008-11-05 14:36:01 +03:00
signal_pending ( current ) )
2005-04-17 02:20:36 +04:00
break ;
} else {
if ( sock_flag ( sk , SOCK_DONE ) )
break ;
if ( sk - > sk_err ) {
copied = sock_error ( sk ) ;
break ;
}
if ( sk - > sk_shutdown & RCV_SHUTDOWN )
break ;
if ( sk - > sk_state = = TCP_CLOSE ) {
2018-07-08 09:15:56 +03:00
/* This occurs when user tries to read
* from never connected socket .
*/
copied = - ENOTCONN ;
2005-04-17 02:20:36 +04:00
break ;
}
if ( ! timeo ) {
copied = - EAGAIN ;
break ;
}
if ( signal_pending ( current ) ) {
copied = sock_intr_errno ( timeo ) ;
break ;
}
}
if ( copied > = target ) {
/* Do not sleep, just process backlog. */
2021-11-15 22:02:40 +03:00
__sk_flush_backlog ( sk ) ;
2015-07-24 19:19:25 +03:00
} else {
2021-11-15 22:02:48 +03:00
tcp_cleanup_rbuf ( sk , copied ) ;
2023-10-11 10:20:55 +03:00
err = sk_wait_data ( sk , & timeo , last ) ;
if ( err < 0 ) {
err = copied ? : err ;
goto out ;
}
2015-07-24 19:19:25 +03:00
}
2005-04-17 02:20:36 +04:00
2009-05-11 00:32:34 +04:00
if ( ( flags & MSG_PEEK ) & &
( peek_seq - copied - urg_hole ! = tp - > copied_seq ) ) {
2012-05-14 01:56:26 +04:00
net_dbg_ratelimited ( " TCP(%s:%d): Application bug, race in MSG_PEEK \n " ,
current - > comm ,
task_pid_nr ( current ) ) ;
2005-04-17 02:20:36 +04:00
peek_seq = tp - > copied_seq ;
}
continue ;
2018-12-06 15:45:28 +03:00
found_ok_skb :
2005-04-17 02:20:36 +04:00
/* Ok so how much can we use? */
used = skb - > len - offset ;
if ( len < used )
used = len ;
/* Do we have urgent data here? */
2021-11-15 22:02:44 +03:00
if ( unlikely ( tp - > urg_data ) ) {
2005-04-17 02:20:36 +04:00
u32 urg_offset = tp - > urg_seq - * seq ;
if ( urg_offset < used ) {
if ( ! urg_offset ) {
if ( ! sock_flag ( sk , SOCK_URGINLINE ) ) {
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( * seq , * seq + 1 ) ;
2009-05-11 00:32:34 +04:00
urg_hole + + ;
2005-04-17 02:20:36 +04:00
offset + + ;
used - - ;
if ( ! used )
goto skip_copy ;
}
} else
used = urg_offset ;
}
}
if ( ! ( flags & MSG_TRUNC ) ) {
2014-11-06 00:46:40 +03:00
err = skb_copy_datagram_msg ( skb , offset , msg , used ) ;
2013-12-31 00:37:29 +04:00
if ( err ) {
/* Exception. Bailout! */
if ( ! copied )
copied = - EFAULT ;
break ;
2005-04-17 02:20:36 +04:00
}
}
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( * seq , * seq + used ) ;
2005-04-17 02:20:36 +04:00
copied + = used ;
len - = used ;
tcp_rcv_space_adjust ( sk ) ;
skip_copy :
2021-11-15 22:02:44 +03:00
if ( unlikely ( tp - > urg_data ) & & after ( tp - > copied_seq , tp - > urg_seq ) ) {
2021-11-15 22:02:43 +03:00
WRITE_ONCE ( tp - > urg_data , 0 ) ;
2017-08-30 20:24:58 +03:00
tcp_fast_path_check ( sk ) ;
}
2005-04-17 02:20:36 +04:00
2017-08-23 00:08:48 +03:00
if ( TCP_SKB_CB ( skb ) - > has_rxtstamp ) {
2020-12-03 01:53:43 +03:00
tcp_update_recv_tstamps ( skb , tss ) ;
2021-01-21 03:41:47 +03:00
* cmsg_flags | = TCP_CMSG_TS ;
2017-08-23 00:08:48 +03:00
}
2020-05-08 22:58:46 +03:00
if ( used + offset < skb - > len )
continue ;
2014-09-15 15:19:51 +04:00
if ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN )
2005-04-17 02:20:36 +04:00
goto found_fin_ok ;
2013-12-31 00:37:29 +04:00
if ( ! ( flags & MSG_PEEK ) )
2021-11-15 22:02:45 +03:00
tcp_eat_recv_skb ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
continue ;
2018-12-06 15:45:28 +03:00
found_fin_ok :
2005-04-17 02:20:36 +04:00
/* Process the FIN. */
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( * seq , * seq + 1 ) ;
2013-12-31 00:37:29 +04:00
if ( ! ( flags & MSG_PEEK ) )
2021-11-15 22:02:45 +03:00
tcp_eat_recv_skb ( sk , skb ) ;
2005-04-17 02:20:36 +04:00
break ;
} while ( len > 0 ) ;
/* According to UNIX98, msg_name/msg_namelen are ignored
* on connected socket . I was just happy when found this 8 ) - - ANK
*/
/* Clean up data we have read: This will do ACK frames. */
2006-05-24 05:00:16 +04:00
tcp_cleanup_rbuf ( sk , copied ) ;
2005-04-17 02:20:36 +04:00
return copied ;
out :
return err ;
recv_urg :
2009-04-01 01:43:17 +04:00
err = tcp_recv_urg ( sk , msg , len , flags ) ;
2005-04-17 02:20:36 +04:00
goto out ;
2012-04-19 07:41:01 +04:00
recv_sndq :
err = tcp_peek_sndq ( sk , msg , len ) ;
goto out ;
2005-04-17 02:20:36 +04:00
}
2020-12-03 01:53:43 +03:00
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
int tcp_recvmsg ( struct sock * sk , struct msghdr * msg , size_t len , int flags ,
int * addr_len )
2020-12-03 01:53:43 +03:00
{
2022-04-29 03:45:06 +03:00
int cmsg_flags = 0 , ret ;
2020-12-03 01:53:43 +03:00
struct scm_timestamping_internal tss ;
if ( unlikely ( flags & MSG_ERRQUEUE ) )
return inet_recv_error ( sk , msg , len , addr_len ) ;
if ( sk_can_busy_loop ( sk ) & &
skb_queue_empty_lockless ( & sk - > sk_receive_queue ) & &
sk - > sk_state = = TCP_ESTABLISHED )
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
sk_busy_loop ( sk , flags & MSG_DONTWAIT ) ;
2020-12-03 01:53:43 +03:00
lock_sock ( sk ) ;
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 15:49:55 +03:00
ret = tcp_recvmsg_locked ( sk , msg , len , flags , & tss , & cmsg_flags ) ;
2020-12-03 01:53:43 +03:00
release_sock ( sk ) ;
2022-04-29 03:45:06 +03:00
if ( ( cmsg_flags | | msg - > msg_get_inq ) & & ret > = 0 ) {
2021-01-21 03:41:47 +03:00
if ( cmsg_flags & TCP_CMSG_TS )
2020-12-03 01:53:43 +03:00
tcp_recv_timestamp ( msg , sk , & tss ) ;
2022-04-29 03:45:06 +03:00
if ( msg - > msg_get_inq ) {
msg - > msg_inq = tcp_inq_hint ( sk ) ;
if ( cmsg_flags & TCP_CMSG_INQ )
put_cmsg ( msg , SOL_TCP , TCP_CM_INQ ,
sizeof ( msg - > msg_inq ) , & msg - > msg_inq ) ;
2020-12-03 01:53:43 +03:00
}
}
return ret ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_recvmsg ) ;
2005-04-17 02:20:36 +04:00
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
void tcp_set_state ( struct sock * sk , int state )
{
int oldstate = sk - > sk_state ;
2018-01-26 03:14:15 +03:00
/* We defined a new enum for TCP states that are exported in BPF
* so as not force the internal TCP states to be frozen . The
* following checks will detect if an internal state value ever
* differs from the BPF value . If this ever happens , then we will
* need to remap the internal value to the BPF value before calling
* tcp_call_bpf_2arg .
*/
BUILD_BUG_ON ( ( int ) BPF_TCP_ESTABLISHED ! = ( int ) TCP_ESTABLISHED ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_SYN_SENT ! = ( int ) TCP_SYN_SENT ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_SYN_RECV ! = ( int ) TCP_SYN_RECV ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_FIN_WAIT1 ! = ( int ) TCP_FIN_WAIT1 ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_FIN_WAIT2 ! = ( int ) TCP_FIN_WAIT2 ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_TIME_WAIT ! = ( int ) TCP_TIME_WAIT ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_CLOSE ! = ( int ) TCP_CLOSE ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_CLOSE_WAIT ! = ( int ) TCP_CLOSE_WAIT ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_LAST_ACK ! = ( int ) TCP_LAST_ACK ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_LISTEN ! = ( int ) TCP_LISTEN ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_CLOSING ! = ( int ) TCP_CLOSING ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_NEW_SYN_RECV ! = ( int ) TCP_NEW_SYN_RECV ) ;
BUILD_BUG_ON ( ( int ) BPF_TCP_MAX_STATES ! = ( int ) TCP_MAX_STATES ) ;
bpf: net: Emit anonymous enum with BPF_TCP_CLOSE value explicitly
The selftest failed to compile with clang-built bpf-next.
Adding LLVM=1 to your vmlinux and selftest build will use clang.
The error message is:
progs/test_sk_storage_tracing.c:38:18: error: use of undeclared identifier 'BPF_TCP_CLOSE'
if (newstate == BPF_TCP_CLOSE)
^
1 error generated.
make: *** [Makefile:423: /bpf-next/tools/testing/selftests/bpf/test_sk_storage_tracing.o] Error 1
The reason for the failure is that BPF_TCP_CLOSE, a value of
an anonymous enum defined in uapi bpf.h, is not defined in
vmlinux.h. gcc does not have this problem. Since vmlinux.h
is derived from BTF which is derived from vmlinux DWARF,
that means gcc-produced vmlinux DWARF has BPF_TCP_CLOSE
while llvm-produced vmlinux DWARF does not have.
BPF_TCP_CLOSE is referenced in net/ipv4/tcp.c as
BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
The following test mimics the above BUILD_BUG_ON, preprocessed
with clang compiler, and shows gcc DWARF contains BPF_TCP_CLOSE while
llvm DWARF does not.
$ cat t.c
enum {
BPF_TCP_ESTABLISHED = 1,
BPF_TCP_CLOSE = 7,
};
enum {
TCP_ESTABLISHED = 1,
TCP_CLOSE = 7,
};
int test() {
do {
extern void __compiletime_assert_767(void) ;
if ((int)BPF_TCP_CLOSE != (int)TCP_CLOSE) __compiletime_assert_767();
} while (0);
return 0;
}
$ clang t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
$ gcc t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
DW_AT_name ("BPF_TCP_CLOSE")
Further checking clang code find clang actually tried to
evaluate condition at compile time. If it is definitely
true/false, it will perform optimization and the whole if condition
will be removed before generating IR/debuginfo.
This patch explicited add an expression after the
above mentioned BUILD_BUG_ON in net/ipv4/tcp.c like
(void)BPF_TCP_ESTABLISHED
to enable generation of debuginfo for the anonymous
enum which also includes BPF_TCP_CLOSE.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210317174132.589276-1-yhs@fb.com
2021-03-17 20:41:32 +03:00
/* bpf uapi header bpf.h defines an anonymous enum with values
* BPF_TCP_ * used by bpf programs . Currently gcc built vmlinux
* is able to emit this enum in DWARF due to the above BUILD_BUG_ON .
* But clang built vmlinux does not have this enum in DWARF
* since clang removes the above code before generating IR / debuginfo .
* Let us explicitly emit the type debuginfo to ensure the
* above - mentioned anonymous enum in the vmlinux DWARF and hence BTF
* regardless of which compiler is used .
*/
BTF_TYPE_EMIT_ENUM ( BPF_TCP_ESTABLISHED ) ;
2018-01-26 03:14:15 +03:00
if ( BPF_SOCK_OPS_TEST_FLAG ( tcp_sk ( sk ) , BPF_SOCK_OPS_STATE_CB_FLAG ) )
tcp_call_bpf_2arg ( sk , BPF_SOCK_OPS_STATE_CB , oldstate , state ) ;
2017-10-23 19:20:27 +03:00
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
switch ( state ) {
case TCP_ESTABLISHED :
if ( oldstate ! = TCP_ESTABLISHED )
2008-07-17 07:22:04 +04:00
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_CURRESTAB ) ;
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
break ;
case TCP_CLOSE :
if ( oldstate = = TCP_CLOSE_WAIT | | oldstate = = TCP_ESTABLISHED )
2008-07-17 07:22:04 +04:00
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_ESTABRESETS ) ;
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
sk - > sk_prot - > unhash ( sk ) ;
if ( inet_csk ( sk ) - > icsk_bind_hash & &
! ( sk - > sk_userlocks & SOCK_BINDPORT_LOCK ) )
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 15:06:04 +03:00
inet_put_port ( sk ) ;
2020-03-13 01:50:22 +03:00
fallthrough ;
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
default :
2008-11-03 11:24:34 +03:00
if ( oldstate = = TCP_ESTABLISHED )
2008-07-17 07:22:46 +04:00
TCP_DEC_STATS ( sock_net ( sk ) , TCP_MIB_CURRESTAB ) ;
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
}
/* Change state AFTER socket is unhashed to avoid closed
* socket sitting in hash tables .
*/
2017-12-20 06:12:51 +03:00
inet_sk_state_store ( sk , state ) ;
[TCP]: Uninline tcp_set_state
net/ipv4/tcp.c:
tcp_close_state | -226
tcp_done | -145
tcp_close | -564
tcp_disconnect | -141
4 functions changed, 1076 bytes removed, diff: -1076
net/ipv4/tcp_input.c:
tcp_fin | -86
tcp_rcv_state_process | -164
2 functions changed, 250 bytes removed, diff: -250
net/ipv4/tcp_ipv4.c:
tcp_v4_connect | -209
1 function changed, 209 bytes removed, diff: -209
net/ipv4/arp.c:
arp_ignore | +5
1 function changed, 5 bytes added, diff: +5
net/ipv6/tcp_ipv6.c:
tcp_v6_connect | -158
1 function changed, 158 bytes removed, diff: -158
net/sunrpc/xprtsock.c:
xs_sendpages | -2
1 function changed, 2 bytes removed, diff: -2
net/dccp/ccids/ccid3.c:
ccid3_update_send_interval | +7
1 function changed, 7 bytes added, diff: +7
net/ipv4/tcp.c:
tcp_set_state | +238
1 function changed, 238 bytes added, diff: +238
built-in.o:
12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445
I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-12 14:17:20 +03:00
}
EXPORT_SYMBOL_GPL ( tcp_set_state ) ;
2005-04-17 02:20:36 +04:00
/*
* State processing on a close . This implements the state shift for
* sending our FIN frame . Note that we only send a FIN for some
* states . A shutdown ( ) may have already sent the FIN , or we may be
* closed .
*/
2005-11-30 03:21:38 +03:00
static const unsigned char new_state [ 16 ] = {
2005-04-17 02:20:36 +04:00
/* current state: new state: action: */
2015-03-25 01:58:53 +03:00
[ 0 /* (Invalid) */ ] = TCP_CLOSE ,
[ TCP_ESTABLISHED ] = TCP_FIN_WAIT1 | TCP_ACTION_FIN ,
[ TCP_SYN_SENT ] = TCP_CLOSE ,
[ TCP_SYN_RECV ] = TCP_FIN_WAIT1 | TCP_ACTION_FIN ,
[ TCP_FIN_WAIT1 ] = TCP_FIN_WAIT1 ,
[ TCP_FIN_WAIT2 ] = TCP_FIN_WAIT2 ,
[ TCP_TIME_WAIT ] = TCP_CLOSE ,
[ TCP_CLOSE ] = TCP_CLOSE ,
[ TCP_CLOSE_WAIT ] = TCP_LAST_ACK | TCP_ACTION_FIN ,
[ TCP_LAST_ACK ] = TCP_LAST_ACK ,
[ TCP_LISTEN ] = TCP_CLOSE ,
[ TCP_CLOSING ] = TCP_CLOSING ,
[ TCP_NEW_SYN_RECV ] = TCP_CLOSE , /* should not happen ! */
2005-04-17 02:20:36 +04:00
} ;
static int tcp_close_state ( struct sock * sk )
{
int next = ( int ) new_state [ sk - > sk_state ] ;
int ns = next & TCP_STATE_MASK ;
tcp_set_state ( sk , ns ) ;
return next & TCP_ACTION_FIN ;
}
/*
* Shutdown the sending side of a connection . Much like close except
2008-04-21 13:27:58 +04:00
* that we don ' t receive shut down or sock_set_flag ( sk , SOCK_DEAD ) .
2005-04-17 02:20:36 +04:00
*/
void tcp_shutdown ( struct sock * sk , int how )
{
/* We need to grab some memory, and put together a FIN,
* and then put it into the queue to be sent .
* Tim MacKenzie ( tym @ dibbler . cs . monash . edu . au ) 4 Dec ' 92.
*/
if ( ! ( how & SEND_SHUTDOWN ) )
return ;
/* If we've already sent a FIN, or it's a closed state, skip this. */
if ( ( 1 < < sk - > sk_state ) &
( TCPF_ESTABLISHED | TCPF_SYN_SENT |
TCPF_SYN_RECV | TCPF_CLOSE_WAIT ) ) {
/* Clear out any half completed packets. FIN if needed. */
if ( tcp_close_state ( sk ) )
tcp_send_fin ( sk ) ;
}
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_shutdown ) ;
2005-04-17 02:20:36 +04:00
2021-10-14 16:41:26 +03:00
int tcp_orphan_count_sum ( void )
{
int i , total = 0 ;
for_each_possible_cpu ( i )
total + = per_cpu ( tcp_orphan_count , i ) ;
return max ( total , 0 ) ;
}
static int tcp_orphan_cache ;
static struct timer_list tcp_orphan_timer ;
# define TCP_ORPHAN_TIMER_PERIOD msecs_to_jiffies(100)
static void tcp_orphan_update ( struct timer_list * unused )
{
WRITE_ONCE ( tcp_orphan_cache , tcp_orphan_count_sum ( ) ) ;
mod_timer ( & tcp_orphan_timer , jiffies + TCP_ORPHAN_TIMER_PERIOD ) ;
}
static bool tcp_too_many_orphans ( int shift )
{
2022-07-07 02:39:58 +03:00
return READ_ONCE ( tcp_orphan_cache ) < < shift >
READ_ONCE ( sysctl_tcp_max_orphans ) ;
2021-10-14 16:41:26 +03:00
}
2012-01-31 02:16:06 +04:00
bool tcp_check_oom ( struct sock * sk , int shift )
{
bool too_many_orphans , out_of_socket_memory ;
2021-10-14 16:41:26 +03:00
too_many_orphans = tcp_too_many_orphans ( shift ) ;
2012-01-31 02:16:06 +04:00
out_of_socket_memory = tcp_out_of_memory ( sk ) ;
2012-05-14 01:56:26 +04:00
if ( too_many_orphans )
net_info_ratelimited ( " too many orphaned sockets \n " ) ;
if ( out_of_socket_memory )
net_info_ratelimited ( " out of memory -- consider tuning tcp_mem \n " ) ;
2012-01-31 02:16:06 +04:00
return too_many_orphans | | out_of_socket_memory ;
}
2020-11-16 12:48:04 +03:00
void __tcp_close ( struct sock * sk , long timeout )
2005-04-17 02:20:36 +04:00
{
struct sk_buff * skb ;
int data_was_unread = 0 ;
2006-05-04 10:31:35 +04:00
int state ;
2005-04-17 02:20:36 +04:00
2023-05-09 23:36:56 +03:00
WRITE_ONCE ( sk - > sk_shutdown , SHUTDOWN_MASK ) ;
2005-04-17 02:20:36 +04:00
if ( sk - > sk_state = = TCP_LISTEN ) {
tcp_set_state ( sk , TCP_CLOSE ) ;
/* Special case. */
2005-08-10 07:11:41 +04:00
inet_csk_listen_stop ( sk ) ;
2005-04-17 02:20:36 +04:00
goto adjudge_to_death ;
}
/* We need to flush the recv. buffs. We do this only on the
* descriptor close , not protocol - sourced closes , because the
* reader process may not have drained the data yet !
*/
while ( ( skb = __skb_dequeue ( & sk - > sk_receive_queue ) ) ! = NULL ) {
2014-09-15 15:19:51 +04:00
u32 len = TCP_SKB_CB ( skb ) - > end_seq - TCP_SKB_CB ( skb ) - > seq ;
if ( TCP_SKB_CB ( skb ) - > tcp_flags & TCPHDR_FIN )
len - - ;
2005-04-17 02:20:36 +04:00
data_was_unread + = len ;
__kfree_skb ( skb ) ;
}
tcp: do not send reset to already closed sockets
i've found that tcp_close() can be called for an already closed
socket, but still sends reset in this case (tcp_send_active_reset())
which seems to be incorrect. Moreover, a packet with reset is sent
with different source port as original port number has been already
cleared on socket. Besides that incrementing stat counter for
LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case.
Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same
seems to be true for the current mainstream kernel (checked on
2.6.35-rc3). Please, correct me if i missed something.
How that happens:
1) the server receives a packet for socket in TCP_CLOSE_WAIT state
that triggers a tcp_reset():
Call Trace:
<IRQ> [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8
[<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08
[<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a
[<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43
[<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259
[<ffffffff80037131>] ip_local_deliver+0x200/0x2f4
[<ffffffff8003843c>] ip_rcv+0x64c/0x69f
[<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa
[<ffffffff80032eca>] process_backlog+0x90/0xec
[<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1
[<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce
[<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0
[<ffffffff8006334c>] call_softirq+0x1c/0x28
[<ffffffff80070476>] do_softirq+0x2c/0x85
[<ffffffff80070441>] do_IRQ+0x149/0x152
[<ffffffff80062665>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303
[<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303
[<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e
[<ffffffff8006a263>] do_page_fault+0x49a/0x7dc
[<ffffffff80066487>] thread_return+0x89/0x174
[<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c
[<ffffffff80062e39>] error_exit+0x0/0x84
tcp_rcv_state_process()
... // (sk_state == TCP_CLOSE_WAIT here)
...
/* step 2: check RST bit */
if(th->rst) {
tcp_reset(sk);
goto discard;
}
...
---------------------------------
tcp_rcv_state_process
tcp_reset
tcp_done
tcp_set_state(sk, TCP_CLOSE);
inet_put_port
__inet_put_port
inet_sk(sk)->num = 0;
sk->sk_shutdown = SHUTDOWN_MASK;
2) After that the process (socket owner) tries to write something to
that socket and "inet_autobind" sets a _new_ (which differs from
the original!) port number for the socket:
Call Trace:
[<ffffffff80255a12>] inet_bind_hash+0x33/0x5f
[<ffffffff80257180>] inet_csk_get_port+0x216/0x268
[<ffffffff8026bcc9>] inet_autobind+0x22/0x8f
[<ffffffff80049140>] inet_sendmsg+0x27/0x57
[<ffffffff8003a9d9>] do_sock_write+0xae/0xea
[<ffffffff80226ac7>] sock_writev+0xdc/0xf6
[<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
[<ffffffff8001fb49>] __pollwait+0x0/0xdd
[<ffffffff8008d533>] default_wake_function+0x0/0xe
[<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
[<ffffffff800f0b49>] do_readv_writev+0x163/0x274
[<ffffffff80066538>] thread_return+0x13a/0x174
[<ffffffff800145d8>] tcp_poll+0x0/0x1c9
[<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
[<ffffffff800f0dd0>] sys_writev+0x49/0xe4
[<ffffffff800622dd>] tracesys+0xd5/0xe0
3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace):
F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3
Call Trace:
[<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87
[<ffffffff80033300>] release_sock+0x10/0xae
[<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7
[<ffffffff8026bd32>] inet_autobind+0x8b/0x8f
[<ffffffff8003a9d9>] do_sock_write+0xae/0xea
[<ffffffff80226ac7>] sock_writev+0xdc/0xf6
[<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
[<ffffffff8001fb49>] __pollwait+0x0/0xdd
[<ffffffff8008d533>] default_wake_function+0x0/0xe
[<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
[<ffffffff800f0b49>] do_readv_writev+0x163/0x274
[<ffffffff80066538>] thread_return+0x13a/0x174
[<ffffffff800145d8>] tcp_poll+0x0/0x1c9
[<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
[<ffffffff800f0dd0>] sys_writev+0x49/0xe4
[<ffffffff800622dd>] tracesys+0xd5/0xe0
tcp_sendmsg()
...
/* Wait for a connection to finish. */
if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
int old_state = sk->sk_state;
if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) {
if (f_d && (err == -EPIPE)) {
printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, "
"sk_err=%d, sk_shutdown=%d\n",
sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state,
sk->sk_err, sk->sk_shutdown);
dump_stack();
}
goto out_err;
}
}
...
4) Then the process (socket owner) understands that it's time to close
that socket and does that (and thus triggers sending reset packet):
Call Trace:
...
[<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6
[<ffffffff80034698>] ip_output+0x351/0x384
[<ffffffff80251ae9>] dst_output+0x0/0xe
[<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2
[<ffffffff80095700>] vprintk+0x21/0x33
[<ffffffff800070f0>] check_poison_obj+0x2e/0x206
[<ffffffff80013587>] poison_obj+0x36/0x45
[<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
[<ffffffff80023481>] dbg_redzone1+0x1c/0x25
[<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
[<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8
[<ffffffff80023405>] tcp_transmit_skb+0x764/0x786
[<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d
[<ffffffff80258ff1>] tcp_close+0x39a/0x960
[<ffffffff8026be12>] inet_release+0x69/0x80
[<ffffffff80059b31>] sock_release+0x4f/0xcf
[<ffffffff80059d4c>] sock_close+0x2c/0x30
[<ffffffff800133c9>] __fput+0xac/0x197
[<ffffffff800252bc>] filp_close+0x59/0x61
[<ffffffff8001eff6>] sys_close+0x85/0xc7
[<ffffffff800622dd>] tracesys+0xd5/0xe0
So, in brief:
* a received packet for socket in TCP_CLOSE_WAIT state triggers
tcp_reset() which clears inet_sk(sk)->num and put socket into
TCP_CLOSE state
* an attempt to write to that socket forces inet_autobind() to get a
new port (but the write itself fails with -EPIPE)
* tcp_close() called for socket in TCP_CLOSE state sends an active
reset via socket with newly allocated port
This adds an additional check in tcp_close() for already closed
sockets. We do not want to send anything to closed sockets.
Signed-off-by: Konstantin Khorenko <khorenko@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 08:54:58 +04:00
/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
if ( sk - > sk_state = = TCP_CLOSE )
goto adjudge_to_death ;
2007-04-29 08:21:46 +04:00
/* As outlined in RFC 2525, section 2.17, we send a RST here because
* data was lost . To witness the awful effects of the old behavior of
* always doing a FIN , run an older 2.1 . x kernel or 2.0 . x , start a bulk
* GET in an FTP client , suspend the process , wait for the client to
* advertise a zero window , then kill - 9 the FTP client , wheee . . .
* Note : timeout is always zero in such a case .
2005-04-17 02:20:36 +04:00
*/
2012-04-19 07:40:39 +04:00
if ( unlikely ( tcp_sk ( sk ) - > repair ) ) {
sk - > sk_prot - > disconnect ( sk , 0 ) ;
} else if ( data_was_unread ) {
2005-04-17 02:20:36 +04:00
/* Unread data was tossed, zap the connection. */
net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 02:44:27 +03:00
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONCLOSE ) ;
2005-04-17 02:20:36 +04:00
tcp_set_state ( sk , TCP_CLOSE ) ;
2009-09-03 10:45:45 +04:00
tcp_send_active_reset ( sk , sk - > sk_allocation ) ;
2005-04-17 02:20:36 +04:00
} else if ( sock_flag ( sk , SOCK_LINGER ) & & ! sk - > sk_lingertime ) {
/* Check zero linger _after_ checking for unread data. */
sk - > sk_prot - > disconnect ( sk , 0 ) ;
net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 02:44:27 +03:00
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPABORTONDATA ) ;
2005-04-17 02:20:36 +04:00
} else if ( tcp_close_state ( sk ) ) {
/* We FIN if the application ate all the data before
* zapping the connection .
*/
/* RED-PEN. Formally speaking, we have broken TCP state
* machine . State transitions :
*
* TCP_ESTABLISHED - > TCP_FIN_WAIT1
* TCP_SYN_RECV - > TCP_FIN_WAIT1 ( forget it , it ' s impossible )
* TCP_CLOSE_WAIT - > TCP_LAST_ACK
*
* are legal only when FIN has been sent ( i . e . in window ) ,
* rather than queued out of window . Purists blame .
*
* F . e . " RFC state " is ESTABLISHED ,
* if Linux state is FIN - WAIT - 1 , but FIN is still not sent .
*
* The visible declinations are that sometimes
* we enter time - wait state , when it is not required really
* ( harmless ) , do not send active resets , when they are
* required by specs ( TCP_ESTABLISHED , TCP_CLOSE_WAIT , when
* they look as CLOSING or LAST_ACK for Linux )
* Probably , I missed some more holelets .
* - - ANK
2012-08-31 16:29:12 +04:00
* XXX ( TFO ) - To start off we don ' t support SYN + ACK + FIN
* in a single packet ! ( May consider it later but will
* probably need API support or TCP_CORK SYN - ACK until
* data is written and socket is closed . )
2005-04-17 02:20:36 +04:00
*/
tcp_send_fin ( sk ) ;
}
sk_stream_wait_close ( sk , timeout ) ;
adjudge_to_death :
2006-05-04 10:31:35 +04:00
state = sk - > sk_state ;
sock_hold ( sk ) ;
sock_orphan ( sk ) ;
2005-04-17 02:20:36 +04:00
local_bh_disable ( ) ;
bh_lock_sock ( sk ) ;
2018-10-02 09:24:26 +03:00
/* remove backlog if any, without releasing ownership. */
__release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
2021-10-14 16:41:26 +03:00
this_cpu_inc ( tcp_orphan_count ) ;
2008-12-30 10:04:08 +03:00
2006-05-04 10:31:35 +04:00
/* Have we already been destroyed by a softirq or backlog? */
if ( state ! = TCP_CLOSE & & sk - > sk_state = = TCP_CLOSE )
goto out ;
2005-04-17 02:20:36 +04:00
/* This is a (useful) BSD violating of the RFC. There is a
* problem with TCP as specified in that the other end could
* keep a socket open forever with no application left this end .
2014-02-10 02:30:32 +04:00
* We use a 1 minute timeout ( about the same as BSD ) then kill
2005-04-17 02:20:36 +04:00
* our end . If they send after that then tough - BUT : long enough
* that we won ' t make the old 4 * rto = almost no time - whoops
* reset mistake .
*
* Nope , it was not mistake . It is really desired behaviour
* f . e . on http servers , when such sockets are useless , but
* consume significant resources . Let ' s do it with special
* linger2 option . - - ANK
*/
if ( sk - > sk_state = = TCP_FIN_WAIT2 ) {
struct tcp_sock * tp = tcp_sk ( sk ) ;
2023-08-04 17:46:15 +03:00
if ( READ_ONCE ( tp - > linger2 ) < 0 ) {
2005-04-17 02:20:36 +04:00
tcp_set_state ( sk , TCP_CLOSE ) ;
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) ,
2008-07-17 07:31:16 +04:00
LINUX_MIB_TCPABORTONLINGER ) ;
2005-04-17 02:20:36 +04:00
} else {
2005-08-10 07:10:42 +04:00
const int tmo = tcp_fin_time ( sk ) ;
2005-04-17 02:20:36 +04:00
if ( tmo > TCP_TIMEWAIT_LEN ) {
2006-08-01 09:32:09 +04:00
inet_csk_reset_keepalive_timer ( sk ,
tmo - TCP_TIMEWAIT_LEN ) ;
2005-04-17 02:20:36 +04:00
} else {
tcp_time_wait ( sk , TCP_FIN_WAIT2 , tmo ) ;
goto out ;
}
}
}
if ( sk - > sk_state ! = TCP_CLOSE ) {
2012-01-31 02:16:06 +04:00
if ( tcp_check_oom ( sk , 0 ) ) {
2005-04-17 02:20:36 +04:00
tcp_set_state ( sk , TCP_CLOSE ) ;
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
2016-04-28 02:44:39 +03:00
__NET_INC_STATS ( sock_net ( sk ) ,
2008-07-17 07:31:16 +04:00
LINUX_MIB_TCPABORTONMEMORY ) ;
net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 00:14:26 +03:00
} else if ( ! check_net ( sock_net ( sk ) ) ) {
/* Not possible to send reset; just close */
tcp_set_state ( sk , TCP_CLOSE ) ;
2005-04-17 02:20:36 +04:00
}
}
2012-08-31 16:29:12 +04:00
if ( sk - > sk_state = = TCP_CLOSE ) {
2019-10-11 06:17:38 +03:00
struct request_sock * req ;
req = rcu_dereference_protected ( tcp_sk ( sk ) - > fastopen_rsk ,
lockdep_sock_is_held ( sk ) ) ;
2012-08-31 16:29:12 +04:00
/* We could get here with a non-NULL req if the socket is
* aborted ( e . g . , closed with unread data ) before 3 WHS
* finishes .
*/
2015-04-03 11:17:27 +03:00
if ( req )
2012-08-31 16:29:12 +04:00
reqsk_fastopen_remove ( sk , req , false ) ;
2005-08-10 07:11:41 +04:00
inet_csk_destroy_sock ( sk ) ;
2012-08-31 16:29:12 +04:00
}
2005-04-17 02:20:36 +04:00
/* Otherwise, socket is reprieved until protocol close. */
out :
bh_unlock_sock ( sk ) ;
local_bh_enable ( ) ;
2020-11-16 12:48:04 +03:00
}
void tcp_close ( struct sock * sk , long timeout )
{
lock_sock ( sk ) ;
__tcp_close ( sk , timeout ) ;
2018-10-02 09:24:26 +03:00
release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
sock_put ( sk ) ;
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_close ) ;
2005-04-17 02:20:36 +04:00
/* These states need RST on ABORT according to RFC793 */
2012-05-17 03:15:34 +04:00
static inline bool tcp_need_reset ( int state )
2005-04-17 02:20:36 +04:00
{
return ( 1 < < state ) &
( TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
2021-04-09 20:02:37 +03:00
TCPF_FIN_WAIT2 | TCPF_SYN_RECV ) ;
2005-04-17 02:20:36 +04:00
}
2017-10-06 08:21:27 +03:00
static void tcp_rtx_queue_purge ( struct sock * sk )
{
struct rb_node * p = rb_first ( & sk - > tcp_rtx_queue ) ;
2020-01-23 08:03:00 +03:00
tcp_sk ( sk ) - > highest_sack = NULL ;
2017-10-06 08:21:27 +03:00
while ( p ) {
struct sk_buff * skb = rb_to_skb ( p ) ;
p = rb_next ( p ) ;
/* Since we are deleting whole queue, no need to
* list_del ( & skb - > tcp_tsorted_anchor )
*/
tcp_rtx_queue_unlink ( skb , sk ) ;
2021-10-30 05:05:41 +03:00
tcp_wmem_free_skb ( sk , skb ) ;
2017-10-06 08:21:27 +03:00
}
}
2017-10-06 08:21:22 +03:00
void tcp_write_queue_purge ( struct sock * sk )
{
struct sk_buff * skb ;
tcp_chrono_stop ( sk , TCP_CHRONO_BUSY ) ;
while ( ( skb = __skb_dequeue ( & sk - > sk_write_queue ) ) ! = NULL ) {
tcp_skb_tsorted_anchor_cleanup ( skb ) ;
2021-10-30 05:05:41 +03:00
tcp_wmem_free_skb ( sk , skb ) ;
2017-10-06 08:21:22 +03:00
}
2017-10-06 08:21:27 +03:00
tcp_rtx_queue_purge ( sk ) ;
2017-10-06 08:21:22 +03:00
INIT_LIST_HEAD ( & tcp_sk ( sk ) - > tsorted_sent_queue ) ;
tcp_clear_all_retrans_hints ( tcp_sk ( sk ) ) ;
2018-04-15 03:44:46 +03:00
tcp_sk ( sk ) - > packets_out = 0 ;
2019-02-16 00:36:20 +03:00
inet_csk ( sk ) - > icsk_backoff = 0 ;
2017-10-06 08:21:22 +03:00
}
2005-04-17 02:20:36 +04:00
int tcp_disconnect ( struct sock * sk , int flags )
{
struct inet_sock * inet = inet_sk ( sk ) ;
2005-08-10 07:10:42 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
int old_state = sk - > sk_state ;
2019-10-11 06:17:41 +03:00
u32 seq ;
2005-04-17 02:20:36 +04:00
if ( old_state ! = TCP_CLOSE )
tcp_set_state ( sk , TCP_CLOSE ) ;
/* ABORT function of RFC793 */
if ( old_state = = TCP_LISTEN ) {
2005-08-10 07:11:41 +04:00
inet_csk_listen_stop ( sk ) ;
2012-04-19 07:40:39 +04:00
} else if ( unlikely ( tp - > repair ) ) {
2023-03-15 23:57:44 +03:00
WRITE_ONCE ( sk - > sk_err , ECONNABORTED ) ;
2005-04-17 02:20:36 +04:00
} else if ( tcp_need_reset ( old_state ) | |
( tp - > snd_nxt ! = tp - > write_seq & &
( 1 < < old_state ) & ( TCPF_CLOSING | TCPF_LAST_ACK ) ) ) {
2005-11-11 04:13:47 +03:00
/* The last check adjusts for discrepancy of Linux wrt. RFC
2005-04-17 02:20:36 +04:00
* states
*/
tcp_send_active_reset ( sk , gfp_any ( ) ) ;
2023-03-15 23:57:44 +03:00
WRITE_ONCE ( sk - > sk_err , ECONNRESET ) ;
2021-04-09 20:02:37 +03:00
} else if ( old_state = = TCP_SYN_SENT )
2023-03-15 23:57:44 +03:00
WRITE_ONCE ( sk - > sk_err , ECONNRESET ) ;
2005-04-17 02:20:36 +04:00
tcp_clear_xmit_timers ( sk ) ;
__skb_queue_purge ( & sk - > sk_receive_queue ) ;
2019-10-11 06:17:40 +03:00
WRITE_ONCE ( tp - > copied_seq , tp - > rcv_nxt ) ;
2021-11-15 22:02:43 +03:00
WRITE_ONCE ( tp - > urg_data , 0 ) ;
2007-03-07 23:12:44 +03:00
tcp_write_queue_purge ( sk ) ;
net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X S
[retry and timeout]
https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.
The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST
We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().
The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.
Also, add a debug sysctl:
tcp_fastopen_blackhole_detect_timeout_sec:
the initial timeout to use when firewall blackhole issue happens.
This can be set and read.
When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 00:45:46 +03:00
tcp_fastopen_active_disable_ofo_check ( sk ) ;
tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 00:49:28 +03:00
skb_rbtree_purge ( & tp - > out_of_order_queue ) ;
2005-04-17 02:20:36 +04:00
2009-10-15 10:30:45 +04:00
inet - > inet_dport = 0 ;
2005-04-17 02:20:36 +04:00
2022-11-19 04:49:14 +03:00
inet_bhash2_reset_saddr ( sk ) ;
2005-04-17 02:20:36 +04:00
2023-05-09 23:36:56 +03:00
WRITE_ONCE ( sk - > sk_shutdown , 0 ) ;
2005-04-17 02:20:36 +04:00
sock_reset_flag ( sk , SOCK_DONE ) ;
2014-02-27 02:02:48 +04:00
tp - > srtt_us = 0 ;
2019-01-17 22:23:36 +03:00
tp - > mdev_us = jiffies_to_usecs ( TCP_TIMEOUT_INIT ) ;
2018-06-20 07:42:50 +03:00
tp - > rcv_rtt_last_tsecr = 0 ;
2019-10-11 06:17:41 +03:00
seq = tp - > write_seq + tp - > max_window + 2 ;
if ( ! seq )
seq = 1 ;
WRITE_ONCE ( tp - > write_seq , seq ) ;
2005-08-10 07:10:42 +04:00
icsk - > icsk_backoff = 0 ;
2005-08-10 11:03:31 +04:00
icsk - > icsk_probes_out = 0 ;
2021-01-16 01:30:58 +03:00
icsk - > icsk_probes_tstamp = 0 ;
2019-01-17 22:23:33 +03:00
icsk - > icsk_rto = TCP_TIMEOUT_INIT ;
2020-08-20 22:00:27 +03:00
icsk - > icsk_rto_min = TCP_RTO_MIN ;
2020-08-20 22:00:21 +03:00
icsk - > icsk_delack_max = TCP_DELACK_MAX ;
2009-09-15 12:30:10 +04:00
tp - > snd_ssthresh = TCP_INFINITE_SSTHRESH ;
2022-04-06 02:35:38 +03:00
tcp_snd_cwnd_set ( tp , TCP_INIT_CWND ) ;
2005-04-17 02:20:36 +04:00
tp - > snd_cwnd_cnt = 0 ;
tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its
peak due to TSO deferral deciding not to send another skb yet.
In such cases the connection can advance max_packets_seq and set
tp->is_cwnd_limited to true and max_packets_out to a small
number.
(b) Then later in the round trip the connection is pacing-limited (not
cwnd-limited), and packets_out is larger. In such cases the
connection would raise max_packets_out to a bigger number but
(unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp->is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.
Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for
the current window.
(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp->packets_out > tp->max_packets_out can only trigger an update of
tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-28 23:03:31 +03:00
tp - > is_cwnd_limited = 0 ;
tp - > max_packets_out = 0 ;
2009-11-30 23:53:30 +03:00
tp - > window_clamp = 0 ;
2020-01-31 21:22:47 +03:00
tp - > delivered = 0 ;
2018-04-18 09:18:48 +03:00
tp - > delivered_ce = 0 ;
2020-07-09 02:18:34 +03:00
if ( icsk - > icsk_ca_ops - > release )
icsk - > icsk_ca_ops - > release ( sk ) ;
memset ( icsk - > icsk_ca_priv , 0 , sizeof ( icsk - > icsk_ca_priv ) ) ;
2020-09-10 22:35:32 +03:00
icsk - > icsk_ca_initialized = 0 ;
2005-08-10 11:03:31 +04:00
tcp_set_ca_state ( sk , TCP_CA_Open ) ;
2017-12-08 00:41:34 +03:00
tp - > is_sack_reneg = 0 ;
2005-04-17 02:20:36 +04:00
tcp_clear_retrans ( tp ) ;
2020-01-31 20:14:47 +03:00
tp - > total_retrans = 0 ;
2005-08-10 07:10:42 +04:00
inet_csk_delack_init ( sk ) ;
2017-05-18 21:22:33 +03:00
/* Initialize rcv_mss to TCP_MIN_MSS to avoid division by 0
* issue in __tcp_select_window ( )
*/
icsk - > icsk_ack . rcv_mss = TCP_MIN_MSS ;
2007-05-04 04:32:28 +04:00
memset ( & tp - > rx_opt , 0 , sizeof ( tp - > rx_opt ) ) ;
2005-04-17 02:20:36 +04:00
__sk_dst_reset ( sk ) ;
inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly
documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.
[a] dst_release(dst);
[b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.
[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
dst_check include/net/dst.h:470 [inline]
tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
</TASK>
Allocated by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
ip_route_input_rcu net/ipv4/route.c:2470 [inline]
ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
__netif_receive_skb_list net/core/dev.c:5608 [inline]
netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
gro_normal_list net/core/dev.c:5853 [inline]
gro_normal_list net/core/dev.c:5849 [inline]
napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
__napi_poll+0xaf/0x440 net/core/dev.c:7023
napi_poll net/core/dev.c:7090 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7177
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
rcu_do_batch kernel/rcu/tree.c:2506 [inline]
rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
__kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
__call_rcu kernel/rcu/tree.c:2985 [inline]
call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
dst_release net/core/dst.c:177 [inline]
dst_release+0x79/0xe0 net/core/dst.c:167
tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
sk_backlog_rcv include/net/sock.h:1030 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2768
release_sock+0x54/0x1b0 net/core/sock.c:3300
tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_write_iter+0x289/0x3c0 net/socket.c:1057
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700
which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
dst_alloc+0x146/0x1f0 net/core/dst.c:92
rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
__mkroute_output net/ipv4/route.c:2564 [inline]
ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
__ip_route_output_key include/net/route.h:126 [inline]
ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
ip_route_output_key include/net/route.h:142 [inline]
geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
geneve_xmit_skb drivers/net/geneve.c:899 [inline]
geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
__netdev_start_xmit include/linux/netdevice.h:4994 [inline]
netdev_start_xmit include/linux/netdevice.h:5008 [inline]
xmit_one net/core/dev.c:3590 [inline]
dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
__dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1338 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
free_unref_page_prepare mm/page_alloc.c:3309 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3388
qlink_free mm/kasan/quarantine.c:146 [inline]
qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
kasan_slab_alloc include/linux/kasan.h:259 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3234 [inline]
kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
__alloc_skb+0x215/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1126 [inline]
alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address:
ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 17:33:30 +03:00
dst_release ( xchg ( ( __force struct dst_entry * * ) & sk - > sk_rx_dst , NULL ) ) ;
tcp: clear saved_syn in tcp_disconnect()
In the (very unlikely) case a passive socket becomes a listener,
we do not want to duplicate its saved SYN headers.
This would lead to double frees, use after free, and please hackers and
various fuzzers
Tested:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 5) = 0
+0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <...>
+.1 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 connect(4, AF_UNSPEC, ...) = 0
+0 close(3) = 0
+0 bind(4, ..., ...) = 0
+0 listen(4, 5) = 0
+0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <...>
+.1 < . 1:1(0) ack 1 win 257
Fixes: cd8ae85299d5 ("tcp: provide SYN headers for passive connections")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 18:07:33 +03:00
tcp_saved_syn_free ( tp ) ;
tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 00:47:26 +03:00
tp - > compressed_ack = 0 ;
2020-01-31 21:44:50 +03:00
tp - > segs_in = 0 ;
tp - > segs_out = 0 ;
2018-08-01 03:46:21 +03:00
tp - > bytes_sent = 0 ;
2019-07-07 02:13:07 +03:00
tp - > bytes_acked = 0 ;
tp - > bytes_received = 0 ;
2018-08-01 03:46:22 +03:00
tp - > bytes_retrans = 0 ;
2020-01-31 21:32:41 +03:00
tp - > data_segs_in = 0 ;
tp - > data_segs_out = 0 ;
2018-08-30 00:53:56 +03:00
tp - > duplicate_sack [ 0 ] . start_seq = 0 ;
tp - > duplicate_sack [ 0 ] . end_seq = 0 ;
2018-08-01 03:46:23 +03:00
tp - > dsack_dups = 0 ;
2018-08-01 03:46:24 +03:00
tp - > reord_seen = 0 ;
2019-01-17 22:23:39 +03:00
tp - > retrans_out = 0 ;
tp - > sacked_out = 0 ;
tp - > tlp_high_seq = 0 ;
tp - > last_oow_ack_time = 0 ;
2022-10-26 16:51:14 +03:00
tp - > plb_rehash = 0 ;
2019-01-17 22:23:40 +03:00
/* There's a bubble in the pipe until at least the first ACK. */
tp - > app_limited = ~ 0U ;
2023-01-19 22:00:28 +03:00
tp - > rate_app_limited = 1 ;
2019-01-17 22:23:41 +03:00
tp - > rack . mstamp = 0 ;
tp - > rack . advanced = 0 ;
tp - > rack . reo_wnd_steps = 1 ;
tp - > rack . last_delivered = 0 ;
tp - > rack . reo_wnd_persist = 0 ;
tp - > rack . dsack_seen = 0 ;
2019-01-17 22:23:42 +03:00
tp - > syn_data_acked = 0 ;
tp - > rx_opt . saw_tstamp = 0 ;
tp - > rx_opt . dsack = 0 ;
tp - > rx_opt . num_sacks = 0 ;
2019-09-14 02:23:34 +03:00
tp - > rcv_ooopack = 0 ;
2019-01-17 22:23:40 +03:00
2005-04-17 02:20:36 +04:00
2017-03-02 00:29:48 +03:00
/* Clean up fastopen related fields */
tcp_free_fastopen_req ( tp ) ;
2023-08-16 11:15:45 +03:00
inet_clear_bit ( DEFER_CONNECT , sk ) ;
tcp: add TCP_INFO status for failed client TFO
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
or not data-in-SYN was ack'd on both the client and server side. We'd like
to gather more information on the client-side in the failure case in order
to indicate the reason for the failure. This can be useful for not only
debugging TFO, but also for creating TFO socket policies. For example, if
a middle box removes the TFO option or drops a data-in-SYN, we can
can detect this case, and turn off TFO for these connections saving the
extra retransmits.
The newly added tcpi_fastopen_client_fail status is 2 bits and has the
following 4 states:
1) TFO_STATUS_UNSPEC
Catch-all state which includes when TFO is disabled via black hole
detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE.
2) TFO_COOKIE_UNAVAILABLE
If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
is available in the cache.
3) TFO_DATA_NOT_ACKED
Data was sent with SYN, we received a SYN/ACK but it did not cover the data
portion. Cookie is not accepted by server because the cookie may be invalid
or the server may be overloaded.
4) TFO_SYN_RETRANSMITTED
Data was sent with SYN, we received a SYN/ACK which did not cover the data
after at least 1 additional SYN was sent (without data). It may be the case
that a middle-box is dropping data-in-SYN packets. Thus, it would be more
efficient to not use TFO on this connection to avoid extra retransmits
during connection establishment.
These new fields do not cover all the cases where TFO may fail, but other
failures, such as SYN/ACK + data being dropped, will result in the
connection not becoming established. And a connection blackhole after
session establishment shows up as a stalled connection.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23 18:09:26 +03:00
tp - > fastopen_client_fail = 0 ;
2017-03-02 00:29:48 +03:00
2009-10-15 10:30:45 +04:00
WARN_ON ( inet - > inet_num & & ! icsk - > icsk_bind_hash ) ;
2005-04-17 02:20:36 +04:00
2018-01-26 11:40:41 +03:00
if ( sk - > sk_frag . page ) {
put_page ( sk - > sk_frag . page ) ;
sk - > sk_frag . page = NULL ;
sk - > sk_frag . offset = 0 ;
}
2021-06-28 01:48:21 +03:00
sk_error_report ( sk ) ;
2018-08-03 11:28:48 +03:00
return 0 ;
2005-04-17 02:20:36 +04:00
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_disconnect ) ;
2005-04-17 02:20:36 +04:00
2012-05-17 03:15:34 +04:00
static inline bool tcp_can_repair_sock ( const struct sock * sk )
2012-04-19 07:40:39 +04:00
{
2022-08-17 09:17:30 +03:00
return sockopt_ns_capable ( sock_net ( sk ) - > user_ns , CAP_NET_ADMIN ) & &
2016-11-15 05:15:14 +03:00
( sk - > sk_state ! = TCP_LISTEN ) ;
2012-04-19 07:40:39 +04:00
}
2020-07-23 09:09:06 +03:00
static int tcp_repair_set_window ( struct tcp_sock * tp , sockptr_t optbuf , int len )
2016-06-28 01:33:56 +03:00
{
struct tcp_repair_window opt ;
if ( ! tp - > repair )
return - EPERM ;
if ( len ! = sizeof ( opt ) )
return - EINVAL ;
2020-07-23 09:09:06 +03:00
if ( copy_from_sockptr ( & opt , optbuf , sizeof ( opt ) ) )
2016-06-28 01:33:56 +03:00
return - EFAULT ;
if ( opt . max_window < opt . snd_wnd )
return - EINVAL ;
if ( after ( opt . snd_wl1 , tp - > rcv_nxt + opt . rcv_wnd ) )
return - EINVAL ;
if ( after ( opt . rcv_wup , tp - > rcv_nxt ) )
return - EINVAL ;
tp - > snd_wl1 = opt . snd_wl1 ;
tp - > snd_wnd = opt . snd_wnd ;
tp - > max_window = opt . max_window ;
tp - > rcv_wnd = opt . rcv_wnd ;
tp - > rcv_wup = opt . rcv_wup ;
return 0 ;
}
2020-07-23 09:09:06 +03:00
static int tcp_repair_options_est ( struct sock * sk , sockptr_t optbuf ,
unsigned int len )
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
{
2017-05-26 20:28:00 +03:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2012-04-26 03:43:04 +04:00
struct tcp_repair_opt opt ;
2020-07-28 19:38:35 +03:00
size_t offset = 0 ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
2012-04-26 03:43:04 +04:00
while ( len > = sizeof ( opt ) ) {
2020-07-28 19:38:35 +03:00
if ( copy_from_sockptr_offset ( & opt , optbuf , offset , sizeof ( opt ) ) )
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
return - EFAULT ;
2020-07-28 19:38:35 +03:00
offset + = sizeof ( opt ) ;
2012-04-26 03:43:04 +04:00
len - = sizeof ( opt ) ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
2012-04-26 03:43:04 +04:00
switch ( opt . opt_code ) {
case TCPOPT_MSS :
tp - > rx_opt . mss_clamp = opt . opt_val ;
2017-05-26 20:28:00 +03:00
tcp_mtup_init ( sk ) ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
break ;
2012-04-26 03:43:04 +04:00
case TCPOPT_WINDOW :
2012-09-19 13:40:00 +04:00
{
u16 snd_wscale = opt . opt_val & 0xFFFF ;
u16 rcv_wscale = opt . opt_val > > 16 ;
2017-04-04 16:09:48 +03:00
if ( snd_wscale > TCP_MAX_WSCALE | | rcv_wscale > TCP_MAX_WSCALE )
2012-09-19 13:40:00 +04:00
return - EFBIG ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
2012-09-19 13:40:00 +04:00
tp - > rx_opt . snd_wscale = snd_wscale ;
tp - > rx_opt . rcv_wscale = rcv_wscale ;
tp - > rx_opt . wscale_ok = 1 ;
}
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
break ;
case TCPOPT_SACK_PERM :
2012-04-26 03:43:04 +04:00
if ( opt . opt_val ! = 0 )
return - EINVAL ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
tp - > rx_opt . sack_ok | = TCP_SACK_SEEN ;
break ;
case TCPOPT_TIMESTAMP :
2012-04-26 03:43:04 +04:00
if ( opt . opt_val ! = 0 )
return - EINVAL ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
tp - > rx_opt . tstamp_ok = 1 ;
break ;
}
}
return 0 ;
}
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 21:57:25 +03:00
DEFINE_STATIC_KEY_FALSE ( tcp_tx_delay_enabled ) ;
EXPORT_SYMBOL ( tcp_tx_delay_enabled ) ;
static void tcp_enable_tx_delay ( void )
{
if ( ! static_branch_unlikely ( & tcp_tx_delay_enabled ) ) {
static int __tcp_tx_delay_enabled = 0 ;
if ( cmpxchg ( & __tcp_tx_delay_enabled , 0 , 1 ) = = 0 ) {
static_branch_enable ( & tcp_tx_delay_enabled ) ;
pr_info ( " TCP_TX_DELAY enabled \n " ) ;
}
}
}
2020-05-28 08:12:18 +03:00
/* When set indicates to always queue non-full frames. Later the user clears
* this option and we transmit any pending partial frames in the queue . This is
* meant to be used alongside sendfile ( ) to get properly filled frames when the
* user ( for example ) must write out headers with a write ( ) call first and then
* use sendfile to send out the data parts .
*
* TCP_CORK can be set together with TCP_NODELAY and it is stronger than
* TCP_NODELAY .
*/
2021-12-04 01:35:39 +03:00
void __tcp_sock_set_cork ( struct sock * sk , bool on )
2020-05-28 08:12:18 +03:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( on ) {
tp - > nonagle | = TCP_NAGLE_CORK ;
} else {
tp - > nonagle & = ~ TCP_NAGLE_CORK ;
if ( tp - > nonagle & TCP_NAGLE_OFF )
tp - > nonagle | = TCP_NAGLE_PUSH ;
tcp_push_pending_frames ( sk ) ;
}
}
void tcp_sock_set_cork ( struct sock * sk , bool on )
{
lock_sock ( sk ) ;
__tcp_sock_set_cork ( sk , on ) ;
release_sock ( sk ) ;
}
EXPORT_SYMBOL ( tcp_sock_set_cork ) ;
2020-05-28 08:12:19 +03:00
/* TCP_NODELAY is weaker than TCP_CORK, so that this option on corked socket is
* remembered , but it is not activated until cork is cleared .
*
* However , when TCP_NODELAY is set we make an explicit push , which overrides
* even TCP_CORK for currently queued segments .
*/
2021-12-04 01:35:39 +03:00
void __tcp_sock_set_nodelay ( struct sock * sk , bool on )
2020-05-28 08:12:19 +03:00
{
if ( on ) {
tcp_sk ( sk ) - > nonagle | = TCP_NAGLE_OFF | TCP_NAGLE_PUSH ;
tcp_push_pending_frames ( sk ) ;
} else {
tcp_sk ( sk ) - > nonagle & = ~ TCP_NAGLE_OFF ;
}
}
void tcp_sock_set_nodelay ( struct sock * sk )
{
lock_sock ( sk ) ;
__tcp_sock_set_nodelay ( sk , true ) ;
release_sock ( sk ) ;
}
EXPORT_SYMBOL ( tcp_sock_set_nodelay ) ;
2020-05-28 08:12:20 +03:00
static void __tcp_sock_set_quickack ( struct sock * sk , int val )
{
if ( ! val ) {
inet_csk_enter_pingpong_mode ( sk ) ;
return ;
}
inet_csk_exit_pingpong_mode ( sk ) ;
if ( ( 1 < < sk - > sk_state ) & ( TCPF_ESTABLISHED | TCPF_CLOSE_WAIT ) & &
inet_csk_ack_scheduled ( sk ) ) {
inet_csk ( sk ) - > icsk_ack . pending | = ICSK_ACK_PUSHED ;
tcp_cleanup_rbuf ( sk , 1 ) ;
if ( ! ( val & 1 ) )
inet_csk_enter_pingpong_mode ( sk ) ;
}
}
void tcp_sock_set_quickack ( struct sock * sk , int val )
{
lock_sock ( sk ) ;
__tcp_sock_set_quickack ( sk , val ) ;
release_sock ( sk ) ;
}
EXPORT_SYMBOL ( tcp_sock_set_quickack ) ;
2020-05-28 08:12:21 +03:00
int tcp_sock_set_syncnt ( struct sock * sk , int val )
{
if ( val < 1 | | val > MAX_TCP_SYNCNT )
return - EINVAL ;
2023-07-20 00:28:52 +03:00
WRITE_ONCE ( inet_csk ( sk ) - > icsk_syn_retries , val ) ;
2020-05-28 08:12:21 +03:00
return 0 ;
}
EXPORT_SYMBOL ( tcp_sock_set_syncnt ) ;
2023-08-04 17:46:12 +03:00
int tcp_sock_set_user_timeout ( struct sock * sk , int val )
2020-05-28 08:12:22 +03:00
{
2023-08-04 17:46:12 +03:00
/* Cap the max time in ms TCP will retry or probe the window
* before giving up and aborting ( ETIMEDOUT ) a connection .
*/
if ( val < 0 )
return - EINVAL ;
2023-07-20 00:28:56 +03:00
WRITE_ONCE ( inet_csk ( sk ) - > icsk_user_timeout , val ) ;
2023-08-04 17:46:12 +03:00
return 0 ;
2020-05-28 08:12:22 +03:00
}
EXPORT_SYMBOL ( tcp_sock_set_user_timeout ) ;
2020-06-20 18:30:51 +03:00
int tcp_sock_set_keepidle_locked ( struct sock * sk , int val )
2020-05-28 08:12:23 +03:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( val < 1 | | val > MAX_TCP_KEEPIDLE )
return - EINVAL ;
2023-07-20 00:28:49 +03:00
/* Paired with WRITE_ONCE() in keepalive_time_when() */
WRITE_ONCE ( tp - > keepalive_time , val * HZ ) ;
2020-05-28 08:12:23 +03:00
if ( sock_flag ( sk , SOCK_KEEPOPEN ) & &
! ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) ) {
u32 elapsed = keepalive_time_elapsed ( tp ) ;
if ( tp - > keepalive_time > elapsed )
elapsed = tp - > keepalive_time - elapsed ;
else
elapsed = 0 ;
inet_csk_reset_keepalive_timer ( sk , elapsed ) ;
}
return 0 ;
}
int tcp_sock_set_keepidle ( struct sock * sk , int val )
{
int err ;
lock_sock ( sk ) ;
2020-06-20 18:30:51 +03:00
err = tcp_sock_set_keepidle_locked ( sk , val ) ;
2020-05-28 08:12:23 +03:00
release_sock ( sk ) ;
return err ;
}
EXPORT_SYMBOL ( tcp_sock_set_keepidle ) ;
2020-05-28 08:12:24 +03:00
int tcp_sock_set_keepintvl ( struct sock * sk , int val )
{
if ( val < 1 | | val > MAX_TCP_KEEPINTVL )
return - EINVAL ;
2023-07-20 00:28:50 +03:00
WRITE_ONCE ( tcp_sk ( sk ) - > keepalive_intvl , val * HZ ) ;
2020-05-28 08:12:24 +03:00
return 0 ;
}
EXPORT_SYMBOL ( tcp_sock_set_keepintvl ) ;
2020-05-28 08:12:25 +03:00
int tcp_sock_set_keepcnt ( struct sock * sk , int val )
{
if ( val < 1 | | val > MAX_TCP_KEEPCNT )
return - EINVAL ;
2023-07-20 00:28:51 +03:00
/* Paired with READ_ONCE() in keepalive_probes() */
WRITE_ONCE ( tcp_sk ( sk ) - > keepalive_probes , val ) ;
2020-05-28 08:12:25 +03:00
return 0 ;
}
EXPORT_SYMBOL ( tcp_sock_set_keepcnt ) ;
2020-12-03 00:31:51 +03:00
int tcp_set_window_clamp ( struct sock * sk , int val )
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
if ( ! val ) {
if ( sk - > sk_state ! = TCP_CLOSE )
return - EINVAL ;
tp - > window_clamp = 0 ;
} else {
tp - > window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
SOCK_MIN_RCVBUF / 2 : val ;
2021-08-26 00:01:17 +03:00
tp - > rcv_ssthresh = min ( tp - > rcv_wnd , tp - > window_clamp ) ;
2020-12-03 00:31:51 +03:00
}
return 0 ;
}
2005-04-17 02:20:36 +04:00
/*
* Socket option code for TCP .
*/
2022-08-17 09:18:19 +03:00
int do_tcp_setsockopt ( struct sock * sk , int level , int optname ,
sockptr_t optval , unsigned int optlen )
2005-04-17 02:20:36 +04:00
{
struct tcp_sock * tp = tcp_sk ( sk ) ;
2005-08-10 07:10:42 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2016-02-03 10:46:56 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
int val ;
int err = 0 ;
2009-12-02 21:19:30 +03:00
/* These are data/string values, all the others are ints */
switch ( optname ) {
case TCP_CONGESTION : {
2005-06-24 07:37:36 +04:00
char name [ TCP_CA_NAME_MAX ] ;
if ( optlen < 1 )
return - EINVAL ;
2020-07-23 09:09:06 +03:00
val = strncpy_from_sockptr ( name , optval ,
2009-10-02 02:02:20 +04:00
min_t ( long , TCP_CA_NAME_MAX - 1 , optlen ) ) ;
2005-06-24 07:37:36 +04:00
if ( val < 0 )
return - EFAULT ;
name [ val ] = 0 ;
2022-08-17 09:17:30 +03:00
sockopt_lock_sock ( sk ) ;
2022-08-31 02:19:46 +03:00
err = tcp_set_congestion_control ( sk , name , ! has_current_bpf_ctx ( ) ,
2022-08-17 09:17:30 +03:00
sockopt_ns_capable ( sock_net ( sk ) - > user_ns ,
CAP_NET_ADMIN ) ) ;
sockopt_release_sock ( sk ) ;
2005-06-24 07:37:36 +04:00
return err ;
}
2017-06-14 21:37:14 +03:00
case TCP_ULP : {
char name [ TCP_ULP_NAME_MAX ] ;
if ( optlen < 1 )
return - EINVAL ;
2020-07-23 09:09:06 +03:00
val = strncpy_from_sockptr ( name , optval ,
2017-06-14 21:37:14 +03:00
min_t ( long , TCP_ULP_NAME_MAX - 1 ,
optlen ) ) ;
if ( val < 0 )
return - EFAULT ;
name [ val ] = 0 ;
2022-08-17 09:17:30 +03:00
sockopt_lock_sock ( sk ) ;
2017-06-14 21:37:14 +03:00
err = tcp_set_ulp ( sk , name ) ;
2022-08-17 09:17:30 +03:00
sockopt_release_sock ( sk ) ;
2017-06-14 21:37:14 +03:00
return err ;
}
2017-10-18 21:22:51 +03:00
case TCP_FASTOPEN_KEY : {
2019-05-29 19:33:58 +03:00
__u8 key [ TCP_FASTOPEN_KEY_BUF_LENGTH ] ;
__u8 * backup_key = NULL ;
2017-10-18 21:22:51 +03:00
2019-05-29 19:33:58 +03:00
/* Allow a backup key as well to facilitate key rotation
* First key is the active one .
*/
if ( optlen ! = TCP_FASTOPEN_KEY_LENGTH & &
optlen ! = TCP_FASTOPEN_KEY_BUF_LENGTH )
2017-10-18 21:22:51 +03:00
return - EINVAL ;
2020-07-23 09:09:06 +03:00
if ( copy_from_sockptr ( key , optval , optlen ) )
2017-10-18 21:22:51 +03:00
return - EFAULT ;
2019-05-29 19:33:58 +03:00
if ( optlen = = TCP_FASTOPEN_KEY_BUF_LENGTH )
backup_key = key + TCP_FASTOPEN_KEY_LENGTH ;
2019-06-20 00:46:28 +03:00
return tcp_fastopen_reset_cipher ( net , sk , key , backup_key ) ;
2017-10-18 21:22:51 +03:00
}
2009-12-02 21:19:30 +03:00
default :
/* fallthru */
break ;
2010-05-14 14:58:26 +04:00
}
2005-06-24 07:37:36 +04:00
2005-04-17 02:20:36 +04:00
if ( optlen < sizeof ( int ) )
return - EINVAL ;
2020-07-23 09:09:06 +03:00
if ( copy_from_sockptr ( & val , optval , sizeof ( val ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2023-08-04 17:46:11 +03:00
/* Handle options that can be set without locking the socket. */
switch ( optname ) {
case TCP_SYNCNT :
return tcp_sock_set_syncnt ( sk , val ) ;
2023-08-04 17:46:12 +03:00
case TCP_USER_TIMEOUT :
return tcp_sock_set_user_timeout ( sk , val ) ;
2023-08-04 17:46:13 +03:00
case TCP_KEEPINTVL :
return tcp_sock_set_keepintvl ( sk , val ) ;
2023-08-04 17:46:14 +03:00
case TCP_KEEPCNT :
return tcp_sock_set_keepcnt ( sk , val ) ;
2023-08-04 17:46:15 +03:00
case TCP_LINGER2 :
if ( val < 0 )
WRITE_ONCE ( tp - > linger2 , - 1 ) ;
else if ( val > TCP_FIN_TIMEOUT_MAX / HZ )
WRITE_ONCE ( tp - > linger2 , TCP_FIN_TIMEOUT_MAX ) ;
else
WRITE_ONCE ( tp - > linger2 , val * HZ ) ;
return 0 ;
2023-08-04 17:46:16 +03:00
case TCP_DEFER_ACCEPT :
/* Translate value in seconds to number of retransmits */
WRITE_ONCE ( icsk - > icsk_accept_queue . rskq_defer_accept ,
secs_to_retrans ( val , TCP_TIMEOUT_INIT / HZ ,
TCP_RTO_MAX / HZ ) ) ;
return 0 ;
2023-08-04 17:46:11 +03:00
}
2022-08-17 09:17:30 +03:00
sockopt_lock_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
switch ( optname ) {
case TCP_MAXSEG :
/* Values greater than interface MTU won't take effect. However
* at the point when this call is done we typically don ' t yet
2017-05-22 09:29:15 +03:00
* know which interface is going to be used
*/
2017-03-21 04:28:03 +03:00
if ( val & & ( val < TCP_MIN_MSS | | val > MAX_TCP_WINDOW ) ) {
2005-04-17 02:20:36 +04:00
err = - EINVAL ;
break ;
}
tp - > rx_opt . user_mss = val ;
break ;
case TCP_NODELAY :
2020-05-28 08:12:19 +03:00
__tcp_sock_set_nodelay ( sk , val ) ;
2005-04-17 02:20:36 +04:00
break ;
2010-02-18 05:47:01 +03:00
case TCP_THIN_LINEAR_TIMEOUTS :
if ( val < 0 | | val > 1 )
err = - EINVAL ;
else
tp - > thin_lto = val ;
break ;
2010-02-18 07:48:19 +03:00
case TCP_THIN_DUPACK :
if ( val < 0 | | val > 1 )
err = - EINVAL ;
break ;
2012-04-19 07:40:39 +04:00
case TCP_REPAIR :
if ( ! tcp_can_repair_sock ( sk ) )
err = - EPERM ;
2018-07-15 18:36:37 +03:00
else if ( val = = TCP_REPAIR_ON ) {
2012-04-19 07:40:39 +04:00
tp - > repair = 1 ;
sk - > sk_reuse = SK_FORCE_REUSE ;
tp - > repair_queue = TCP_NO_QUEUE ;
2018-07-15 18:36:37 +03:00
} else if ( val = = TCP_REPAIR_OFF ) {
2012-04-19 07:40:39 +04:00
tp - > repair = 0 ;
sk - > sk_reuse = SK_NO_REUSE ;
tcp_send_window_probe ( sk ) ;
2018-07-15 18:36:37 +03:00
} else if ( val = = TCP_REPAIR_OFF_NO_WP ) {
2012-04-19 07:40:39 +04:00
tp - > repair = 0 ;
sk - > sk_reuse = SK_NO_REUSE ;
} else
err = - EINVAL ;
break ;
case TCP_REPAIR_QUEUE :
if ( ! tp - > repair )
err = - EPERM ;
tcp: fix TCP_REPAIR_QUEUE bound checking
syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
with following C-repro :
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
writev(3, [{"\270", 1}], 1) = 1
setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144
The 3rd system call looks odd :
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
This patch makes sure bound checking is using an unsigned compare.
Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-30 04:55:20 +03:00
else if ( ( unsigned int ) val < TCP_QUEUES_NR )
2012-04-19 07:40:39 +04:00
tp - > repair_queue = val ;
else
err = - EINVAL ;
break ;
case TCP_QUEUE_SEQ :
tcp: add sanity tests to TCP_QUEUE_SEQ
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.
mprotect(0x4aa000, 12288, PROT_READ) = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
syslog shows:
[ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.
This patch fixes this case, and the tx path as well.
Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 21:29:17 +03:00
if ( sk - > sk_state ! = TCP_CLOSE ) {
2012-04-19 07:40:39 +04:00
err = - EPERM ;
tcp: add sanity tests to TCP_QUEUE_SEQ
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.
mprotect(0x4aa000, 12288, PROT_READ) = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
syslog shows:
[ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.
This patch fixes this case, and the tx path as well.
Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 21:29:17 +03:00
} else if ( tp - > repair_queue = = TCP_SEND_QUEUE ) {
if ( ! tcp_rtx_queue_empty ( sk ) )
err = - EPERM ;
else
WRITE_ONCE ( tp - > write_seq , val ) ;
} else if ( tp - > repair_queue = = TCP_RECV_QUEUE ) {
if ( tp - > rcv_nxt ! = tp - > copied_seq ) {
err = - EPERM ;
} else {
WRITE_ONCE ( tp - > rcv_nxt , val ) ;
WRITE_ONCE ( tp - > copied_seq , val ) ;
}
} else {
2012-04-19 07:40:39 +04:00
err = - EINVAL ;
tcp: add sanity tests to TCP_QUEUE_SEQ
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.
mprotect(0x4aa000, 12288, PROT_READ) = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
syslog shows:
[ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.
This patch fixes this case, and the tx path as well.
Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 21:29:17 +03:00
}
2012-04-19 07:40:39 +04:00
break ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
case TCP_REPAIR_OPTIONS :
if ( ! tp - > repair )
err = - EINVAL ;
2022-11-04 05:27:23 +03:00
else if ( sk - > sk_state = = TCP_ESTABLISHED & & ! tp - > bytes_sent )
2020-07-23 09:09:06 +03:00
err = tcp_repair_options_est ( sk , optval , optlen ) ;
tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.
Only 4 options should be handled on repaired socket.
To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.
Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.
The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 07:41:57 +04:00
else
err = - EPERM ;
break ;
2005-04-17 02:20:36 +04:00
case TCP_CORK :
2020-05-28 08:12:18 +03:00
__tcp_sock_set_cork ( sk , val ) ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_KEEPIDLE :
2020-06-20 18:30:51 +03:00
err = tcp_sock_set_keepidle_locked ( sk , val ) ;
2005-04-17 02:20:36 +04:00
break ;
2015-05-04 07:34:46 +03:00
case TCP_SAVE_SYN :
2020-08-20 22:01:23 +03:00
/* 0: disable, 1: enable, 2: start from ether_header */
if ( val < 0 | | val > 2 )
2015-05-04 07:34:46 +03:00
err = - EINVAL ;
else
tp - > save_syn = val ;
break ;
2005-04-17 02:20:36 +04:00
case TCP_WINDOW_CLAMP :
2020-12-03 00:31:51 +03:00
err = tcp_set_window_clamp ( sk , val ) ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_QUICKACK :
2020-05-28 08:12:20 +03:00
__tcp_sock_set_quickack ( sk , val ) ;
2005-04-17 02:20:36 +04:00
break ;
2006-11-15 06:07:45 +03:00
# ifdef CONFIG_TCP_MD5SIG
case TCP_MD5SIG :
2017-06-16 04:07:07 +03:00
case TCP_MD5SIG_EXT :
2020-07-23 09:09:06 +03:00
err = tp - > af_specific - > md5_parse ( sk , optname , optval , optlen ) ;
2006-11-15 06:07:45 +03:00
break ;
# endif
2012-08-31 16:29:12 +04:00
case TCP_FASTOPEN :
if ( val > = 0 & & ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE |
tcp: Do not call tcp_fastopen_reset_cipher from interrupt context
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90
[ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310
[ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-18 19:15:34 +03:00
TCPF_LISTEN ) ) ) {
2017-09-27 06:35:42 +03:00
tcp_fastopen_init_key_once ( net ) ;
tcp: Do not call tcp_fastopen_reset_cipher from interrupt context
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90
[ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310
[ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-18 19:15:34 +03:00
2015-09-29 17:42:52 +03:00
fastopen_queue_tune ( sk , val ) ;
tcp: Do not call tcp_fastopen_reset_cipher from interrupt context
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90
[ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310
[ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-18 19:15:34 +03:00
} else {
2012-08-31 16:29:12 +04:00
err = - EINVAL ;
tcp: Do not call tcp_fastopen_reset_cipher from interrupt context
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90
[ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310
[ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-18 19:15:34 +03:00
}
2012-08-31 16:29:12 +04:00
break ;
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
case TCP_FASTOPEN_CONNECT :
if ( val > 1 | | val < 0 ) {
err = - EINVAL ;
2022-07-15 20:17:54 +03:00
} else if ( READ_ONCE ( net - > ipv4 . sysctl_tcp_fastopen ) &
TFO_CLIENT_ENABLE ) {
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
if ( sk - > sk_state = = TCP_CLOSE )
tp - > fastopen_connect = val ;
else
err = - EINVAL ;
} else {
err = - EOPNOTSUPP ;
}
break ;
2017-10-23 23:22:23 +03:00
case TCP_FASTOPEN_NO_COOKIE :
if ( val > 1 | | val < 0 )
err = - EINVAL ;
else if ( ! ( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) )
err = - EINVAL ;
else
tp - > fastopen_no_cookie = val ;
break ;
2013-02-11 09:50:18 +04:00
case TCP_TIMESTAMP :
2023-10-20 15:57:47 +03:00
if ( ! tp - > repair ) {
2013-02-11 09:50:18 +04:00
err = - EPERM ;
2023-10-20 15:57:47 +03:00
break ;
}
/* val is an opaque field,
* and low order bit contains usec_ts enable bit .
* Its a best effort , and we do not care if user makes an error .
*/
tp - > tcp_usec_ts = val & 1 ;
WRITE_ONCE ( tp - > tsoffset , val - tcp_clock_ts ( tp - > tcp_usec_ts ) ) ;
2013-02-11 09:50:18 +04:00
break ;
2016-06-28 01:33:56 +03:00
case TCP_REPAIR_WINDOW :
err = tcp_repair_set_window ( tp , optval , optlen ) ;
break ;
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
case TCP_NOTSENT_LOWAT :
2023-07-20 00:28:55 +03:00
WRITE_ONCE ( tp - > notsent_lowat , val ) ;
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
sk - > sk_write_space ( sk ) ;
break ;
tcp: send in-queue bytes in cmsg upon read
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 22:39:15 +03:00
case TCP_INQ :
if ( val > 1 | | val < 0 )
err = - EINVAL ;
else
tp - > recvmsg_inq = val ;
break ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 21:57:25 +03:00
case TCP_TX_DELAY :
if ( val )
tcp_enable_tx_delay ( ) ;
2023-07-20 00:28:47 +03:00
WRITE_ONCE ( tp - > tcp_tx_delay , val ) ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 21:57:25 +03:00
break ;
2005-04-17 02:20:36 +04:00
default :
err = - ENOPROTOOPT ;
break ;
2007-04-21 04:09:22 +04:00
}
2022-08-17 09:17:30 +03:00
sockopt_release_sock ( sk ) ;
2005-04-17 02:20:36 +04:00
return err ;
}
2020-07-23 09:09:07 +03:00
int tcp_setsockopt ( struct sock * sk , int level , int optname , sockptr_t optval ,
2009-10-01 03:12:20 +04:00
unsigned int optlen )
2006-03-21 09:45:21 +03:00
{
2011-10-21 13:22:42 +04:00
const struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2006-03-21 09:45:21 +03:00
if ( level ! = SOL_TCP )
2022-10-06 21:53:49 +03:00
/* Paired with WRITE_ONCE() in do_ipv6_setsockopt() and tcp_v6_connect() */
return READ_ONCE ( icsk - > icsk_af_ops ) - > setsockopt ( sk , level , optname ,
optval , optlen ) ;
2020-07-23 09:09:07 +03:00
return do_tcp_setsockopt ( sk , level , optname , optval , optlen ) ;
2006-03-21 09:45:21 +03:00
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_setsockopt ) ;
2006-03-21 09:45:21 +03:00
2016-11-28 10:07:17 +03:00
static void tcp_get_info_chrono_stats ( const struct tcp_sock * tp ,
struct tcp_info * info )
{
u64 stats [ __TCP_CHRONO_MAX ] , total = 0 ;
enum tcp_chrono i ;
for ( i = TCP_CHRONO_BUSY ; i < __TCP_CHRONO_MAX ; + + i ) {
stats [ i ] = tp - > chrono_stat [ i - 1 ] ;
if ( i = = tp - > chrono_type )
2017-05-17 00:00:09 +03:00
stats [ i ] + = tcp_jiffies32 - tp - > chrono_start ;
2016-11-28 10:07:17 +03:00
stats [ i ] * = USEC_PER_SEC / HZ ;
total + = stats [ i ] ;
}
info - > tcpi_busy_time = total ;
info - > tcpi_rwnd_limited = stats [ TCP_CHRONO_RWND_LIMITED ] ;
info - > tcpi_sndbuf_limited = stats [ TCP_CHRONO_SNDBUF_LIMITED ] ;
}
2005-04-17 02:20:36 +04:00
/* Return information about state of tcp endpoint in API format. */
2015-04-29 01:28:17 +03:00
void tcp_get_info ( struct sock * sk , struct tcp_info * info )
2005-04-17 02:20:36 +04:00
{
2015-06-15 18:26:20 +03:00
const struct tcp_sock * tp = tcp_sk ( sk ) ; /* iff sk_type == SOCK_STREAM */
2005-08-10 07:10:42 +04:00
const struct inet_connection_sock * icsk = inet_csk ( sk ) ;
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 19:37:53 +03:00
unsigned long rate ;
2017-07-28 20:28:20 +03:00
u32 now ;
2016-01-27 21:52:43 +03:00
u64 rate64 ;
2016-11-04 21:54:32 +03:00
bool slow ;
2005-04-17 02:20:36 +04:00
memset ( info , 0 , sizeof ( * info ) ) ;
2015-06-15 18:26:20 +03:00
if ( sk - > sk_type ! = SOCK_STREAM )
return ;
2005-04-17 02:20:36 +04:00
2017-12-20 06:12:52 +03:00
info - > tcpi_state = inet_sk_state_load ( sk ) ;
2015-11-12 19:43:18 +03:00
2016-11-04 21:54:31 +03:00
/* Report meaningful fields for all TCP states, including listeners */
rate = READ_ONCE ( sk - > sk_pacing_rate ) ;
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 19:37:53 +03:00
rate64 = ( rate ! = ~ 0UL ) ? rate : ~ 0ULL ;
2016-11-09 22:24:22 +03:00
info - > tcpi_pacing_rate = rate64 ;
2016-11-04 21:54:31 +03:00
rate = READ_ONCE ( sk - > sk_max_pacing_rate ) ;
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 19:37:53 +03:00
rate64 = ( rate ! = ~ 0UL ) ? rate : ~ 0ULL ;
2016-11-09 22:24:22 +03:00
info - > tcpi_max_pacing_rate = rate64 ;
2016-11-04 21:54:31 +03:00
info - > tcpi_reordering = tp - > reordering ;
2022-04-06 02:35:38 +03:00
info - > tcpi_snd_cwnd = tcp_snd_cwnd ( tp ) ;
2016-11-04 21:54:31 +03:00
if ( info - > tcpi_state = = TCP_LISTEN ) {
/* listeners aliased fields :
* tcpi_unacked - > Number of children ready for accept ( )
* tcpi_sacked - > max backlog
*/
2019-11-06 01:11:53 +03:00
info - > tcpi_unacked = READ_ONCE ( sk - > sk_ack_backlog ) ;
2019-11-06 01:11:54 +03:00
info - > tcpi_sacked = READ_ONCE ( sk - > sk_max_ack_backlog ) ;
2016-11-04 21:54:31 +03:00
return ;
}
2017-01-09 21:29:27 +03:00
slow = lock_sock_fast ( sk ) ;
2005-08-10 11:03:31 +04:00
info - > tcpi_ca_state = icsk - > icsk_ca_state ;
2005-08-10 07:10:42 +04:00
info - > tcpi_retransmits = icsk - > icsk_retransmits ;
2005-08-10 11:03:31 +04:00
info - > tcpi_probes = icsk - > icsk_probes_out ;
2005-08-10 07:10:42 +04:00
info - > tcpi_backoff = icsk - > icsk_backoff ;
2005-04-17 02:20:36 +04:00
if ( tp - > rx_opt . tstamp_ok )
info - > tcpi_options | = TCPI_OPT_TIMESTAMPS ;
2007-08-09 16:14:46 +04:00
if ( tcp_is_sack ( tp ) )
2005-04-17 02:20:36 +04:00
info - > tcpi_options | = TCPI_OPT_SACK ;
if ( tp - > rx_opt . wscale_ok ) {
info - > tcpi_options | = TCPI_OPT_WSCALE ;
info - > tcpi_snd_wscale = tp - > rx_opt . snd_wscale ;
info - > tcpi_rcv_wscale = tp - > rx_opt . rcv_wscale ;
2007-02-09 17:24:47 +03:00
}
2005-04-17 02:20:36 +04:00
2011-10-03 22:01:21 +04:00
if ( tp - > ecn_flags & TCP_ECN_OK )
2005-04-17 02:20:36 +04:00
info - > tcpi_options | = TCPI_OPT_ECN ;
2011-10-03 22:01:21 +04:00
if ( tp - > ecn_flags & TCP_ECN_SEEN )
info - > tcpi_options | = TCPI_OPT_ECN_SEEN ;
2012-10-19 19:14:44 +04:00
if ( tp - > syn_data_acked )
info - > tcpi_options | = TCPI_OPT_SYN_DATA ;
2023-10-20 15:57:48 +03:00
if ( tp - > tcp_usec_ts )
info - > tcpi_options | = TCPI_OPT_USEC_TS ;
2005-04-17 02:20:36 +04:00
2005-08-10 07:10:42 +04:00
info - > tcpi_rto = jiffies_to_usecs ( icsk - > icsk_rto ) ;
2023-10-06 04:18:40 +03:00
info - > tcpi_ato = jiffies_to_usecs ( min_t ( u32 , icsk - > icsk_ack . ato ,
tcp_delack_max ( sk ) ) ) ;
2005-07-06 02:24:38 +04:00
info - > tcpi_snd_mss = tp - > mss_cache ;
2005-08-10 07:10:42 +04:00
info - > tcpi_rcv_mss = icsk - > icsk_ack . rcv_mss ;
2005-04-17 02:20:36 +04:00
2016-11-04 21:54:31 +03:00
info - > tcpi_unacked = tp - > packets_out ;
info - > tcpi_sacked = tp - > sacked_out ;
2005-04-17 02:20:36 +04:00
info - > tcpi_lost = tp - > lost_out ;
info - > tcpi_retrans = tp - > retrans_out ;
2017-05-17 00:00:03 +03:00
now = tcp_jiffies32 ;
2005-04-17 02:20:36 +04:00
info - > tcpi_last_data_sent = jiffies_to_msecs ( now - tp - > lsndtime ) ;
2005-08-10 07:10:42 +04:00
info - > tcpi_last_data_recv = jiffies_to_msecs ( now - icsk - > icsk_ack . lrcvtime ) ;
2005-04-17 02:20:36 +04:00
info - > tcpi_last_ack_recv = jiffies_to_msecs ( now - tp - > rcv_tstamp ) ;
2005-12-14 10:26:10 +03:00
info - > tcpi_pmtu = icsk - > icsk_pmtu_cookie ;
2005-04-17 02:20:36 +04:00
info - > tcpi_rcv_ssthresh = tp - > rcv_ssthresh ;
2014-02-27 02:02:48 +04:00
info - > tcpi_rtt = tp - > srtt_us > > 3 ;
info - > tcpi_rttvar = tp - > mdev_us > > 2 ;
2005-04-17 02:20:36 +04:00
info - > tcpi_snd_ssthresh = tp - > snd_ssthresh ;
info - > tcpi_advmss = tp - > advmss ;
2017-04-25 20:15:41 +03:00
info - > tcpi_rcv_rtt = tp - > rcv_rtt_est . rtt_us > > 3 ;
2005-04-17 02:20:36 +04:00
info - > tcpi_rcv_space = tp - > rcvq_space . space ;
info - > tcpi_total_retrans = tp - > total_retrans ;
tcp: add pacing_rate information into tcp_info
Add two new fields to struct tcp_info, to report sk_pacing_rate
and sk_max_pacing_rate to monitoring applications, as ss from iproute2.
User exported fields are 64bit, even if kernel is currently using 32bit
fields.
lpaa5:~# ss -i
..
skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
rcv_space:29200
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 02:27:40 +04:00
2016-11-09 22:24:22 +03:00
info - > tcpi_bytes_acked = tp - > bytes_acked ;
info - > tcpi_bytes_received = tp - > bytes_received ;
2016-11-04 21:54:32 +03:00
info - > tcpi_notsent_bytes = max_t ( int , 0 , tp - > write_seq - tp - > snd_nxt ) ;
2016-11-28 10:07:17 +03:00
tcp_get_info_chrono_stats ( tp , info ) ;
2016-11-04 21:54:32 +03:00
2015-05-21 02:35:41 +03:00
info - > tcpi_segs_out = tp - > segs_out ;
2021-11-15 22:02:42 +03:00
/* segs_in and data_segs_in can be updated from tcp_segs_in() from BH */
info - > tcpi_segs_in = READ_ONCE ( tp - > segs_in ) ;
info - > tcpi_data_segs_in = READ_ONCE ( tp - > data_segs_in ) ;
2016-02-12 09:02:53 +03:00
info - > tcpi_min_rtt = tcp_min_rtt ( tp ) ;
2016-03-14 20:52:15 +03:00
info - > tcpi_data_segs_out = tp - > data_segs_out ;
2016-09-20 06:39:16 +03:00
info - > tcpi_delivery_rate_app_limited = tp - > rate_app_limited ? 1 : 0 ;
2017-07-28 20:28:20 +03:00
rate64 = tcp_compute_delivery_rate ( tp ) ;
if ( rate64 )
2016-11-09 22:24:22 +03:00
info - > tcpi_delivery_rate = rate64 ;
2018-04-18 09:18:49 +03:00
info - > tcpi_delivered = tp - > delivered ;
info - > tcpi_delivered_ce = tp - > delivered_ce ;
2018-08-01 03:46:21 +03:00
info - > tcpi_bytes_sent = tp - > bytes_sent ;
2018-08-01 03:46:22 +03:00
info - > tcpi_bytes_retrans = tp - > bytes_retrans ;
2018-08-01 03:46:23 +03:00
info - > tcpi_dsack_dups = tp - > dsack_dups ;
2018-08-01 03:46:24 +03:00
info - > tcpi_reord_seen = tp - > reord_seen ;
2019-09-14 02:23:34 +03:00
info - > tcpi_rcv_ooopack = tp - > rcv_ooopack ;
2019-09-14 02:23:35 +03:00
info - > tcpi_snd_wnd = tp - > snd_wnd ;
2022-10-26 16:51:15 +03:00
info - > tcpi_rcv_wnd = tp - > rcv_wnd ;
info - > tcpi_rehash = tp - > plb_rehash + tp - > timeout_rehash ;
tcp: add TCP_INFO status for failed client TFO
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
or not data-in-SYN was ack'd on both the client and server side. We'd like
to gather more information on the client-side in the failure case in order
to indicate the reason for the failure. This can be useful for not only
debugging TFO, but also for creating TFO socket policies. For example, if
a middle box removes the TFO option or drops a data-in-SYN, we can
can detect this case, and turn off TFO for these connections saving the
extra retransmits.
The newly added tcpi_fastopen_client_fail status is 2 bits and has the
following 4 states:
1) TFO_STATUS_UNSPEC
Catch-all state which includes when TFO is disabled via black hole
detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE.
2) TFO_COOKIE_UNAVAILABLE
If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
is available in the cache.
3) TFO_DATA_NOT_ACKED
Data was sent with SYN, we received a SYN/ACK but it did not cover the data
portion. Cookie is not accepted by server because the cookie may be invalid
or the server may be overloaded.
4) TFO_SYN_RETRANSMITTED
Data was sent with SYN, we received a SYN/ACK which did not cover the data
after at least 1 additional SYN was sent (without data). It may be the case
that a middle-box is dropping data-in-SYN packets. Thus, it would be more
efficient to not use TFO on this connection to avoid extra retransmits
during connection establishment.
These new fields do not cover all the cases where TFO may fail, but other
failures, such as SYN/ACK + data being dropped, will result in the
connection not becoming established. And a connection blackhole after
session establishment shows up as a stalled connection.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23 18:09:26 +03:00
info - > tcpi_fastopen_client_fail = tp - > fastopen_client_fail ;
2023-09-14 17:36:21 +03:00
info - > tcpi_total_rto = tp - > total_rto ;
info - > tcpi_total_rto_recoveries = tp - > total_rto_recoveries ;
info - > tcpi_total_rto_time = tp - > total_rto_time ;
2023-10-20 15:57:39 +03:00
if ( tp - > rto_stamp )
info - > tcpi_total_rto_time + = tcp_clock_ms ( ) - tp - > rto_stamp ;
2023-09-14 17:36:21 +03:00
2017-01-09 21:29:27 +03:00
unlock_sock_fast ( sk , slow ) ;
2005-04-17 02:20:36 +04:00
}
EXPORT_SYMBOL_GPL ( tcp_get_info ) ;
2018-08-01 03:46:20 +03:00
static size_t tcp_opt_stats_get_size ( void )
{
return
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_BUSY */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_RWND_LIMITED */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_SNDBUF_LIMITED */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_DATA_SEGS_OUT */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_TOTAL_RETRANS */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_PACING_RATE */
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_DELIVERY_RATE */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_SND_CWND */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_REORDERING */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_MIN_RTT */
nla_total_size ( sizeof ( u8 ) ) + /* TCP_NLA_RECUR_RETRANS */
nla_total_size ( sizeof ( u8 ) ) + /* TCP_NLA_DELIVERY_RATE_APP_LMT */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_SNDQ_SIZE */
nla_total_size ( sizeof ( u8 ) ) + /* TCP_NLA_CA_STATE */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_SND_SSTHRESH */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_DELIVERED */
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_DELIVERED_CE */
2018-08-01 03:46:21 +03:00
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_BYTES_SENT */
2018-08-01 03:46:22 +03:00
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_BYTES_RETRANS */
2018-08-01 03:46:23 +03:00
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_DSACK_DUPS */
2018-08-01 03:46:24 +03:00
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_REORD_SEEN */
2018-11-16 03:44:12 +03:00
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_SRTT */
2020-01-25 00:34:02 +03:00
nla_total_size ( sizeof ( u16 ) ) + /* TCP_NLA_TIMEOUT_REHASH */
2020-03-09 23:16:40 +03:00
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_BYTES_NOTSENT */
2020-07-31 01:44:40 +03:00
nla_total_size_64bit ( sizeof ( u64 ) ) + /* TCP_NLA_EDT */
2021-01-20 23:41:55 +03:00
nla_total_size ( sizeof ( u8 ) ) + /* TCP_NLA_TTL */
2022-10-26 16:51:14 +03:00
nla_total_size ( sizeof ( u32 ) ) + /* TCP_NLA_REHASH */
2018-08-01 03:46:20 +03:00
0 ;
}
2021-01-20 23:41:55 +03:00
/* Returns TTL or hop limit of an incoming packet from skb. */
static u8 tcp_skb_ttl_or_hop_limit ( const struct sk_buff * skb )
{
if ( skb - > protocol = = htons ( ETH_P_IP ) )
return ip_hdr ( skb ) - > ttl ;
else if ( skb - > protocol = = htons ( ETH_P_IPV6 ) )
return ipv6_hdr ( skb ) - > hop_limit ;
else
return 0 ;
}
2020-07-31 01:44:40 +03:00
struct sk_buff * tcp_get_timestamping_opt_stats ( const struct sock * sk ,
2021-01-20 23:41:55 +03:00
const struct sk_buff * orig_skb ,
const struct sk_buff * ack_skb )
2016-11-28 10:07:18 +03:00
{
const struct tcp_sock * tp = tcp_sk ( sk ) ;
struct sk_buff * stats ;
struct tcp_info info ;
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 19:37:53 +03:00
unsigned long rate ;
2017-07-28 20:28:21 +03:00
u64 rate64 ;
2016-11-28 10:07:18 +03:00
2018-08-01 03:46:20 +03:00
stats = alloc_skb ( tcp_opt_stats_get_size ( ) , GFP_ATOMIC ) ;
2016-11-28 10:07:18 +03:00
if ( ! stats )
return NULL ;
tcp_get_info_chrono_stats ( tp , & info ) ;
nla_put_u64_64bit ( stats , TCP_NLA_BUSY ,
info . tcpi_busy_time , TCP_NLA_PAD ) ;
nla_put_u64_64bit ( stats , TCP_NLA_RWND_LIMITED ,
info . tcpi_rwnd_limited , TCP_NLA_PAD ) ;
nla_put_u64_64bit ( stats , TCP_NLA_SNDBUF_LIMITED ,
info . tcpi_sndbuf_limited , TCP_NLA_PAD ) ;
2017-01-28 03:24:38 +03:00
nla_put_u64_64bit ( stats , TCP_NLA_DATA_SEGS_OUT ,
tp - > data_segs_out , TCP_NLA_PAD ) ;
nla_put_u64_64bit ( stats , TCP_NLA_TOTAL_RETRANS ,
tp - > total_retrans , TCP_NLA_PAD ) ;
2017-07-28 20:28:21 +03:00
rate = READ_ONCE ( sk - > sk_pacing_rate ) ;
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 19:37:53 +03:00
rate64 = ( rate ! = ~ 0UL ) ? rate : ~ 0ULL ;
2017-07-28 20:28:21 +03:00
nla_put_u64_64bit ( stats , TCP_NLA_PACING_RATE , rate64 , TCP_NLA_PAD ) ;
rate64 = tcp_compute_delivery_rate ( tp ) ;
nla_put_u64_64bit ( stats , TCP_NLA_DELIVERY_RATE , rate64 , TCP_NLA_PAD ) ;
2022-04-06 02:35:38 +03:00
nla_put_u32 ( stats , TCP_NLA_SND_CWND , tcp_snd_cwnd ( tp ) ) ;
2017-07-28 20:28:21 +03:00
nla_put_u32 ( stats , TCP_NLA_REORDERING , tp - > reordering ) ;
nla_put_u32 ( stats , TCP_NLA_MIN_RTT , tcp_min_rtt ( tp ) ) ;
nla_put_u8 ( stats , TCP_NLA_RECUR_RETRANS , inet_csk ( sk ) - > icsk_retransmits ) ;
nla_put_u8 ( stats , TCP_NLA_DELIVERY_RATE_APP_LMT , ! ! tp - > rate_app_limited ) ;
2018-03-16 20:51:07 +03:00
nla_put_u32 ( stats , TCP_NLA_SND_SSTHRESH , tp - > snd_ssthresh ) ;
2018-04-18 09:18:49 +03:00
nla_put_u32 ( stats , TCP_NLA_DELIVERED , tp - > delivered ) ;
nla_put_u32 ( stats , TCP_NLA_DELIVERED_CE , tp - > delivered_ce ) ;
2018-03-04 21:38:35 +03:00
nla_put_u32 ( stats , TCP_NLA_SNDQ_SIZE , tp - > write_seq - tp - > snd_una ) ;
2018-03-04 21:38:36 +03:00
nla_put_u8 ( stats , TCP_NLA_CA_STATE , inet_csk ( sk ) - > icsk_ca_state ) ;
2018-04-18 09:18:49 +03:00
2018-08-01 03:46:21 +03:00
nla_put_u64_64bit ( stats , TCP_NLA_BYTES_SENT , tp - > bytes_sent ,
TCP_NLA_PAD ) ;
2018-08-01 03:46:22 +03:00
nla_put_u64_64bit ( stats , TCP_NLA_BYTES_RETRANS , tp - > bytes_retrans ,
TCP_NLA_PAD ) ;
2018-08-01 03:46:23 +03:00
nla_put_u32 ( stats , TCP_NLA_DSACK_DUPS , tp - > dsack_dups ) ;
2018-08-01 03:46:24 +03:00
nla_put_u32 ( stats , TCP_NLA_REORD_SEEN , tp - > reord_seen ) ;
2018-11-16 03:44:12 +03:00
nla_put_u32 ( stats , TCP_NLA_SRTT , tp - > srtt_us > > 3 ) ;
2020-01-25 00:34:02 +03:00
nla_put_u16 ( stats , TCP_NLA_TIMEOUT_REHASH , tp - > timeout_rehash ) ;
2020-03-09 23:16:40 +03:00
nla_put_u32 ( stats , TCP_NLA_BYTES_NOTSENT ,
max_t ( int , 0 , tp - > write_seq - tp - > snd_nxt ) ) ;
2020-07-31 01:44:40 +03:00
nla_put_u64_64bit ( stats , TCP_NLA_EDT , orig_skb - > skb_mstamp_ns ,
TCP_NLA_PAD ) ;
2021-01-20 23:41:55 +03:00
if ( ack_skb )
nla_put_u8 ( stats , TCP_NLA_TTL ,
tcp_skb_ttl_or_hop_limit ( ack_skb ) ) ;
2018-08-01 03:46:21 +03:00
2022-10-26 16:51:14 +03:00
nla_put_u32 ( stats , TCP_NLA_REHASH , tp - > plb_rehash + tp - > timeout_rehash ) ;
2016-11-28 10:07:18 +03:00
return stats ;
}
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
This patch changes bpf_getsockopt(SOL_TCP) to reuse
do_tcp_getsockopt(). It removes the duplicated code from
bpf_getsockopt(SOL_TCP).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
and a few more. It surprises users from time to time. This patch
automatically closes this gap without duplicating more code.
bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
so it stays in sol_tcp_sockopt().
For string name value like TCP_CONGESTION, bpf expects it
is always null terminated, so sol_tcp_sockopt() decrements
optlen by one before calling do_tcp_getsockopt() and
the 'if (optlen < saved_optlen) memset(..,0,..);'
in __bpf_getsockopt() will always do a null termination.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02 03:29:18 +03:00
int do_tcp_getsockopt ( struct sock * sk , int level ,
int optname , sockptr_t optval , sockptr_t optlen )
2005-04-17 02:20:36 +04:00
{
2005-08-10 07:11:56 +04:00
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
2005-04-17 02:20:36 +04:00
struct tcp_sock * tp = tcp_sk ( sk ) ;
2016-02-03 10:46:49 +03:00
struct net * net = sock_net ( sk ) ;
2005-04-17 02:20:36 +04:00
int val , len ;
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
len = min_t ( unsigned int , len , sizeof ( int ) ) ;
if ( len < 0 )
return - EINVAL ;
switch ( optname ) {
case TCP_MAXSEG :
2005-07-06 02:24:38 +04:00
val = tp - > mss_cache ;
2023-05-27 07:03:17 +03:00
if ( tp - > rx_opt . user_mss & &
( ( 1 < < sk - > sk_state ) & ( TCPF_CLOSE | TCPF_LISTEN ) ) )
2005-04-17 02:20:36 +04:00
val = tp - > rx_opt . user_mss ;
2012-04-19 07:41:32 +04:00
if ( tp - > repair )
val = tp - > rx_opt . mss_clamp ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_NODELAY :
val = ! ! ( tp - > nonagle & TCP_NAGLE_OFF ) ;
break ;
case TCP_CORK :
val = ! ! ( tp - > nonagle & TCP_NAGLE_CORK ) ;
break ;
case TCP_KEEPIDLE :
2009-08-29 10:48:54 +04:00
val = keepalive_time_when ( tp ) / HZ ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_KEEPINTVL :
2009-08-29 10:48:54 +04:00
val = keepalive_intvl_when ( tp ) / HZ ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_KEEPCNT :
2009-08-29 10:48:54 +04:00
val = keepalive_probes ( tp ) ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_SYNCNT :
2023-07-20 00:28:52 +03:00
val = READ_ONCE ( icsk - > icsk_syn_retries ) ? :
2022-07-15 20:17:46 +03:00
READ_ONCE ( net - > ipv4 . sysctl_tcp_syn_retries ) ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_LINGER2 :
2023-07-20 00:28:53 +03:00
val = READ_ONCE ( tp - > linger2 ) ;
2005-04-17 02:20:36 +04:00
if ( val > = 0 )
2022-07-15 20:17:50 +03:00
val = ( val ? : READ_ONCE ( net - > ipv4 . sysctl_tcp_fin_timeout ) ) / HZ ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_DEFER_ACCEPT :
2023-07-20 00:28:54 +03:00
val = READ_ONCE ( icsk - > icsk_accept_queue . rskq_defer_accept ) ;
val = retrans_to_secs ( val , TCP_TIMEOUT_INIT / HZ ,
TCP_RTO_MAX / HZ ) ;
2005-04-17 02:20:36 +04:00
break ;
case TCP_WINDOW_CLAMP :
val = tp - > window_clamp ;
break ;
case TCP_INFO : {
struct tcp_info info ;
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
tcp_get_info ( sk , & info ) ;
len = min_t ( unsigned int , len , sizeof ( info ) ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , & info , len ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
return 0 ;
}
2015-04-29 02:23:49 +03:00
case TCP_CC_INFO : {
const struct tcp_congestion_ops * ca_ops ;
union tcp_cc_info info ;
size_t sz = 0 ;
int attr ;
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2015-04-29 02:23:49 +03:00
return - EFAULT ;
ca_ops = icsk - > icsk_ca_ops ;
if ( ca_ops & & ca_ops - > get_info )
sz = ca_ops - > get_info ( sk , ~ 0U , & attr , & info ) ;
len = min_t ( unsigned int , len , sz ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2015-04-29 02:23:49 +03:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , & info , len ) )
2015-04-29 02:23:49 +03:00
return - EFAULT ;
return 0 ;
}
2005-04-17 02:20:36 +04:00
case TCP_QUICKACK :
2019-01-25 21:53:19 +03:00
val = ! inet_csk_in_pingpong_mode ( sk ) ;
2005-04-17 02:20:36 +04:00
break ;
2005-06-24 07:37:36 +04:00
case TCP_CONGESTION :
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2005-06-24 07:37:36 +04:00
return - EFAULT ;
len = min_t ( unsigned int , len , TCP_CA_NAME_MAX ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2005-06-24 07:37:36 +04:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , icsk - > icsk_ca_ops - > name , len ) )
2005-06-24 07:37:36 +04:00
return - EFAULT ;
return 0 ;
2009-12-02 21:19:30 +03:00
2017-06-14 21:37:14 +03:00
case TCP_ULP :
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2017-06-14 21:37:14 +03:00
return - EFAULT ;
len = min_t ( unsigned int , len , TCP_ULP_NAME_MAX ) ;
2017-06-26 18:36:47 +03:00
if ( ! icsk - > icsk_ulp_ops ) {
2022-09-02 03:28:15 +03:00
len = 0 ;
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2017-06-26 18:36:47 +03:00
return - EFAULT ;
return 0 ;
}
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2017-06-14 21:37:14 +03:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , icsk - > icsk_ulp_ops - > name , len ) )
2017-06-14 21:37:14 +03:00
return - EFAULT ;
return 0 ;
2017-10-18 21:22:51 +03:00
case TCP_FASTOPEN_KEY : {
2020-08-10 20:38:39 +03:00
u64 key [ TCP_FASTOPEN_KEY_BUF_LENGTH / sizeof ( u64 ) ] ;
unsigned int key_len ;
2017-10-18 21:22:51 +03:00
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2017-10-18 21:22:51 +03:00
return - EFAULT ;
2020-08-10 20:38:39 +03:00
key_len = tcp_fastopen_get_cipher ( net , icsk , key ) *
TCP_FASTOPEN_KEY_LENGTH ;
2019-05-29 19:33:58 +03:00
len = min_t ( unsigned int , len , key_len ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2017-10-18 21:22:51 +03:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , key , len ) )
2017-10-18 21:22:51 +03:00
return - EFAULT ;
return 0 ;
}
2010-07-30 17:49:35 +04:00
case TCP_THIN_LINEAR_TIMEOUTS :
val = tp - > thin_lto ;
break ;
2017-01-13 09:11:41 +03:00
2010-07-30 17:49:35 +04:00
case TCP_THIN_DUPACK :
2017-01-13 09:11:41 +03:00
val = 0 ;
2010-07-30 17:49:35 +04:00
break ;
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
2012-04-19 07:40:39 +04:00
case TCP_REPAIR :
val = tp - > repair ;
break ;
case TCP_REPAIR_QUEUE :
if ( tp - > repair )
val = tp - > repair_queue ;
else
return - EINVAL ;
break ;
2016-06-28 01:33:56 +03:00
case TCP_REPAIR_WINDOW : {
struct tcp_repair_window opt ;
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2016-06-28 01:33:56 +03:00
return - EFAULT ;
if ( len ! = sizeof ( opt ) )
return - EINVAL ;
if ( ! tp - > repair )
return - EPERM ;
opt . snd_wl1 = tp - > snd_wl1 ;
opt . snd_wnd = tp - > snd_wnd ;
opt . max_window = tp - > max_window ;
opt . rcv_wnd = tp - > rcv_wnd ;
opt . rcv_wup = tp - > rcv_wup ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , & opt , len ) )
2016-06-28 01:33:56 +03:00
return - EFAULT ;
return 0 ;
}
2012-04-19 07:40:39 +04:00
case TCP_QUEUE_SEQ :
if ( tp - > repair_queue = = TCP_SEND_QUEUE )
val = tp - > write_seq ;
else if ( tp - > repair_queue = = TCP_RECV_QUEUE )
val = tp - > rcv_nxt ;
else
return - EINVAL ;
break ;
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
case TCP_USER_TIMEOUT :
2023-07-20 00:28:56 +03:00
val = READ_ONCE ( icsk - > icsk_user_timeout ) ;
tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-27 23:13:28 +04:00
break ;
2014-04-16 20:25:01 +04:00
case TCP_FASTOPEN :
2023-07-20 00:28:57 +03:00
val = READ_ONCE ( icsk - > icsk_accept_queue . fastopenq . max_qlen ) ;
2014-04-16 20:25:01 +04:00
break ;
net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:59:22 +03:00
case TCP_FASTOPEN_CONNECT :
val = tp - > fastopen_connect ;
break ;
2017-10-23 23:22:23 +03:00
case TCP_FASTOPEN_NO_COOKIE :
val = tp - > fastopen_no_cookie ;
break ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 21:57:25 +03:00
case TCP_TX_DELAY :
2023-07-20 00:28:47 +03:00
val = READ_ONCE ( tp - > tcp_tx_delay ) ;
tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.
Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.
EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.
Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.
This requires FQ packet scheduler or a EDT-enabled NIC.
This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.
unsigned int tx_delay = 10000; /* 10 msec */
setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
Note that FQ packet scheduler limits might need some tweaking :
man tc-fq
PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.
flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.
Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.
Use of a jump label makes this support runtime-free, for hosts
never using the option.
Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)
Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 21:57:25 +03:00
break ;
2013-02-11 09:50:18 +04:00
case TCP_TIMESTAMP :
2023-10-20 15:57:47 +03:00
val = tcp_clock_ts ( tp - > tcp_usec_ts ) + READ_ONCE ( tp - > tsoffset ) ;
if ( tp - > tcp_usec_ts )
val | = 1 ;
else
val & = ~ 1 ;
2013-02-11 09:50:18 +04:00
break ;
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
case TCP_NOTSENT_LOWAT :
2023-07-20 00:28:55 +03:00
val = READ_ONCE ( tp - > notsent_lowat ) ;
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
break ;
tcp: send in-queue bytes in cmsg upon read
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 22:39:15 +03:00
case TCP_INQ :
val = tp - > recvmsg_inq ;
break ;
2015-05-04 07:34:46 +03:00
case TCP_SAVE_SYN :
val = tp - > save_syn ;
break ;
case TCP_SAVED_SYN : {
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
2015-05-04 07:34:46 +03:00
return - EFAULT ;
2022-09-02 03:28:21 +03:00
sockopt_lock_sock ( sk ) ;
2015-05-04 07:34:46 +03:00
if ( tp - > saved_syn ) {
2020-08-20 22:00:14 +03:00
if ( len < tcp_saved_syn_len ( tp - > saved_syn ) ) {
2022-09-02 03:28:15 +03:00
len = tcp_saved_syn_len ( tp - > saved_syn ) ;
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) ) {
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-18 21:35:58 +03:00
return - EFAULT ;
}
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-18 21:35:58 +03:00
return - EINVAL ;
}
2020-08-20 22:00:14 +03:00
len = tcp_saved_syn_len ( tp - > saved_syn ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) ) {
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-04 07:34:46 +03:00
return - EFAULT ;
}
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , tp - > saved_syn - > data , len ) ) {
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-04 07:34:46 +03:00
return - EFAULT ;
}
tcp_saved_syn_free ( tp ) ;
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-04 07:34:46 +03:00
} else {
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2015-05-04 07:34:46 +03:00
len = 0 ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2015-05-04 07:34:46 +03:00
return - EFAULT ;
}
return 0 ;
}
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
# ifdef CONFIG_MMU
case TCP_ZEROCOPY_RECEIVE : {
2021-01-21 03:41:48 +03:00
struct scm_timestamping_internal tss ;
2020-12-10 22:16:03 +03:00
struct tcp_zerocopy_receive zc = { } ;
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
int err ;
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & len , optlen , sizeof ( int ) ) )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
return - EFAULT ;
2021-02-26 02:26:28 +03:00
if ( len < 0 | |
len < offsetofend ( struct tcp_zerocopy_receive , length ) )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
return - EINVAL ;
2021-02-12 00:21:07 +03:00
if ( unlikely ( len > sizeof ( zc ) ) ) {
2022-09-02 03:28:15 +03:00
err = check_zeroed_sockptr ( optval , sizeof ( zc ) ,
len - sizeof ( zc ) ) ;
2021-02-12 00:21:07 +03:00
if ( err < 1 )
return err = = 0 ? - EINVAL : err ;
2020-02-15 02:30:49 +03:00
len = sizeof ( zc ) ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2020-02-25 23:38:54 +03:00
return - EFAULT ;
}
2022-09-02 03:28:15 +03:00
if ( copy_from_sockptr ( & zc , optval , len ) )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
return - EFAULT ;
2021-02-12 00:21:07 +03:00
if ( zc . reserved )
return - EINVAL ;
if ( zc . msg_flags & ~ ( TCP_VALID_ZC_MSG_FLAGS ) )
return - EINVAL ;
2022-09-02 03:28:21 +03:00
sockopt_lock_sock ( sk ) ;
2021-01-21 03:41:48 +03:00
err = tcp_zerocopy_receive ( sk , & zc , & tss ) ;
2021-01-15 19:34:59 +03:00
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN ( sk , level , optname ,
& zc , & len , err ) ;
2022-09-02 03:28:21 +03:00
sockopt_release_sock ( sk ) ;
2021-01-21 03:41:48 +03:00
if ( len > = offsetofend ( struct tcp_zerocopy_receive , msg_flags ) )
goto zerocopy_rcv_cmsg ;
2020-02-15 02:30:49 +03:00
switch ( len ) {
2021-01-21 03:41:48 +03:00
case offsetofend ( struct tcp_zerocopy_receive , msg_flags ) :
goto zerocopy_rcv_cmsg ;
case offsetofend ( struct tcp_zerocopy_receive , msg_controllen ) :
case offsetofend ( struct tcp_zerocopy_receive , msg_control ) :
case offsetofend ( struct tcp_zerocopy_receive , flags ) :
case offsetofend ( struct tcp_zerocopy_receive , copybuf_len ) :
case offsetofend ( struct tcp_zerocopy_receive , copybuf_address ) :
2020-02-15 02:30:50 +03:00
case offsetofend ( struct tcp_zerocopy_receive , err ) :
goto zerocopy_rcv_sk_err ;
2020-02-15 02:30:49 +03:00
case offsetofend ( struct tcp_zerocopy_receive , inq ) :
goto zerocopy_rcv_inq ;
case offsetofend ( struct tcp_zerocopy_receive , length ) :
default :
goto zerocopy_rcv_out ;
}
2021-01-21 03:41:48 +03:00
zerocopy_rcv_cmsg :
if ( zc . msg_flags & TCP_CMSG_TS )
tcp_zc_finalize_rx_tstamp ( sk , & zc , & tss ) ;
else
zc . msg_flags = 0 ;
2020-02-15 02:30:50 +03:00
zerocopy_rcv_sk_err :
if ( ! err )
zc . err = sock_error ( sk ) ;
2020-02-15 02:30:49 +03:00
zerocopy_rcv_inq :
zc . inq = tcp_inq_hint ( sk ) ;
zerocopy_rcv_out :
2022-09-02 03:28:15 +03:00
if ( ! err & & copy_to_sockptr ( optval , & zc , len ) )
tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 18:58:08 +03:00
err = - EFAULT ;
return err ;
}
# endif
2005-04-17 02:20:36 +04:00
default :
return - ENOPROTOOPT ;
2007-04-21 04:09:22 +04:00
}
2005-04-17 02:20:36 +04:00
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optlen , & len , sizeof ( int ) ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
2022-09-02 03:28:15 +03:00
if ( copy_to_sockptr ( optval , & val , len ) )
2005-04-17 02:20:36 +04:00
return - EFAULT ;
return 0 ;
}
2021-01-15 19:34:59 +03:00
bool tcp_bpf_bypass_getsockopt ( int level , int optname )
{
/* TCP do_tcp_getsockopt has optimized getsockopt implementation
* to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE .
*/
if ( level = = SOL_TCP & & optname = = TCP_ZEROCOPY_RECEIVE )
return true ;
return false ;
}
EXPORT_SYMBOL ( tcp_bpf_bypass_getsockopt ) ;
2006-03-21 09:45:21 +03:00
int tcp_getsockopt ( struct sock * sk , int level , int optname , char __user * optval ,
int __user * optlen )
{
struct inet_connection_sock * icsk = inet_csk ( sk ) ;
if ( level ! = SOL_TCP )
2022-10-06 21:53:49 +03:00
/* Paired with WRITE_ONCE() in do_ipv6_setsockopt() and tcp_v6_connect() */
return READ_ONCE ( icsk - > icsk_af_ops ) - > getsockopt ( sk , level , optname ,
optval , optlen ) ;
2022-09-02 03:28:15 +03:00
return do_tcp_getsockopt ( sk , level , optname , USER_SOCKPTR ( optval ) ,
USER_SOCKPTR ( optlen ) ) ;
2006-03-21 09:45:21 +03:00
}
2010-07-10 01:22:10 +04:00
EXPORT_SYMBOL ( tcp_getsockopt ) ;
2006-03-21 09:45:21 +03:00
2006-11-15 06:07:45 +03:00
# ifdef CONFIG_TCP_MD5SIG
2014-10-23 23:58:58 +04:00
static DEFINE_PER_CPU ( struct tcp_md5sig_pool , tcp_md5sig_pool ) ;
2013-05-20 10:52:26 +04:00
static DEFINE_MUTEX ( tcp_md5sig_mutex ) ;
2014-10-23 23:58:58 +04:00
static bool tcp_md5sig_pool_populated = false ;
2006-11-15 06:07:45 +03:00
2013-05-20 10:52:26 +04:00
static void __tcp_alloc_md5sig_pool ( void )
2006-11-15 06:07:45 +03:00
{
2016-01-24 16:20:23 +03:00
struct crypto_ahash * hash ;
2006-11-15 06:07:45 +03:00
int cpu ;
2016-01-24 16:20:23 +03:00
hash = crypto_alloc_ahash ( " md5 " , 0 , CRYPTO_ALG_ASYNC ) ;
2016-03-17 21:22:54 +03:00
if ( IS_ERR ( hash ) )
2016-01-24 16:20:23 +03:00
return ;
2006-11-15 06:07:45 +03:00
for_each_possible_cpu ( cpu ) {
2016-06-27 19:51:53 +03:00
void * scratch = per_cpu ( tcp_md5sig_pool , cpu ) . scratch ;
2016-01-24 16:20:23 +03:00
struct ahash_request * req ;
2006-11-15 06:07:45 +03:00
2016-06-27 19:51:53 +03:00
if ( ! scratch ) {
scratch = kmalloc_node ( sizeof ( union tcp_md5sum_block ) +
sizeof ( struct tcphdr ) ,
GFP_KERNEL ,
cpu_to_node ( cpu ) ) ;
if ( ! scratch )
return ;
per_cpu ( tcp_md5sig_pool , cpu ) . scratch = scratch ;
}
2016-01-24 16:20:23 +03:00
if ( per_cpu ( tcp_md5sig_pool , cpu ) . md5_req )
continue ;
req = ahash_request_alloc ( hash , GFP_KERNEL ) ;
if ( ! req )
return ;
ahash_request_set_callback ( req , 0 , NULL , NULL ) ;
per_cpu ( tcp_md5sig_pool , cpu ) . md5_req = req ;
2006-11-15 06:07:45 +03:00
}
2014-10-23 23:58:58 +04:00
/* before setting tcp_md5sig_pool_populated, we must commit all writes
* to memory . See smp_rmb ( ) in tcp_get_md5sig_pool ( )
2013-05-20 10:52:26 +04:00
*/
smp_wmb ( ) ;
2022-08-23 00:15:28 +03:00
/* Paired with READ_ONCE() from tcp_alloc_md5sig_pool()
* and tcp_get_md5sig_pool ( ) .
*/
WRITE_ONCE ( tcp_md5sig_pool_populated , true ) ;
2006-11-15 06:07:45 +03:00
}
2013-05-20 10:52:26 +04:00
bool tcp_alloc_md5sig_pool ( void )
2006-11-15 06:07:45 +03:00
{
2022-08-23 00:15:28 +03:00
/* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
if ( unlikely ( ! READ_ONCE ( tcp_md5sig_pool_populated ) ) ) {
2013-05-20 10:52:26 +04:00
mutex_lock ( & tcp_md5sig_mutex ) ;
2022-11-23 20:38:57 +03:00
if ( ! tcp_md5sig_pool_populated )
2013-05-20 10:52:26 +04:00
__tcp_alloc_md5sig_pool ( ) ;
mutex_unlock ( & tcp_md5sig_mutex ) ;
2006-11-15 06:07:45 +03:00
}
2022-08-23 00:15:28 +03:00
/* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
return READ_ONCE ( tcp_md5sig_pool_populated ) ;
2006-11-15 06:07:45 +03:00
}
EXPORT_SYMBOL ( tcp_alloc_md5sig_pool ) ;
2010-05-16 11:34:04 +04:00
/**
* tcp_get_md5sig_pool - get md5sig_pool for this user
*
* We use percpu structure , so if we succeed , we exit with preemption
* and BH disabled , to make sure another thread or softirq handling
* wont try to get same context .
*/
struct tcp_md5sig_pool * tcp_get_md5sig_pool ( void )
2006-11-15 06:07:45 +03:00
{
2010-05-16 11:34:04 +04:00
local_bh_disable ( ) ;
2006-11-15 06:07:45 +03:00
2022-08-23 00:15:28 +03:00
/* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
if ( READ_ONCE ( tcp_md5sig_pool_populated ) ) {
2014-10-23 23:58:58 +04:00
/* coupled with smp_wmb() in __tcp_alloc_md5sig_pool() */
smp_rmb ( ) ;
return this_cpu_ptr ( & tcp_md5sig_pool ) ;
}
2010-05-16 11:34:04 +04:00
local_bh_enable ( ) ;
return NULL ;
}
EXPORT_SYMBOL ( tcp_get_md5sig_pool ) ;
2006-11-15 06:07:45 +03:00
2008-07-19 11:01:42 +04:00
int tcp_md5_hash_skb_data ( struct tcp_md5sig_pool * hp ,
2011-10-21 13:22:42 +04:00
const struct sk_buff * skb , unsigned int header_len )
2008-07-19 11:01:42 +04:00
{
struct scatterlist sg ;
const struct tcphdr * tp = tcp_hdr ( skb ) ;
2016-01-24 16:20:23 +03:00
struct ahash_request * req = hp - > md5_req ;
2012-04-15 09:58:06 +04:00
unsigned int i ;
const unsigned int head_data_len = skb_headlen ( skb ) > header_len ?
skb_headlen ( skb ) - header_len : 0 ;
2008-07-19 11:01:42 +04:00
const struct skb_shared_info * shi = skb_shinfo ( skb ) ;
2010-05-18 00:40:51 +04:00
struct sk_buff * frag_iter ;
2008-07-19 11:01:42 +04:00
sg_init_table ( & sg , 1 ) ;
sg_set_buf ( & sg , ( ( u8 * ) tp ) + header_len , head_data_len ) ;
2016-01-24 16:20:23 +03:00
ahash_request_set_crypt ( req , & sg , NULL , head_data_len ) ;
if ( crypto_ahash_update ( req ) )
2008-07-19 11:01:42 +04:00
return 1 ;
for ( i = 0 ; i < shi - > nr_frags ; + + i ) {
2019-07-23 06:08:26 +03:00
const skb_frag_t * f = & shi - > frags [ i ] ;
2019-07-30 17:40:33 +03:00
unsigned int offset = skb_frag_off ( f ) ;
2013-05-14 01:25:52 +04:00
struct page * page = skb_frag_page ( f ) + ( offset > > PAGE_SHIFT ) ;
sg_set_page ( & sg , page , skb_frag_size ( f ) ,
offset_in_page ( offset ) ) ;
2016-01-24 16:20:23 +03:00
ahash_request_set_crypt ( req , & sg , NULL , skb_frag_size ( f ) ) ;
if ( crypto_ahash_update ( req ) )
2008-07-19 11:01:42 +04:00
return 1 ;
}
2010-05-18 00:40:51 +04:00
skb_walk_frags ( skb , frag_iter )
if ( tcp_md5_hash_skb_data ( hp , frag_iter , 0 ) )
return 1 ;
2008-07-19 11:01:42 +04:00
return 0 ;
}
EXPORT_SYMBOL ( tcp_md5_hash_skb_data ) ;
2011-10-21 13:22:42 +04:00
int tcp_md5_hash_key ( struct tcp_md5sig_pool * hp , const struct tcp_md5sig_key * key )
2008-07-19 11:01:42 +04:00
{
2020-07-01 21:43:04 +03:00
u8 keylen = READ_ONCE ( key - > keylen ) ; /* paired with WRITE_ONCE() in tcp_md5_do_add */
2008-07-19 11:01:42 +04:00
struct scatterlist sg ;
2020-07-01 02:41:01 +03:00
sg_init_one ( & sg , key - > key , keylen ) ;
ahash_request_set_crypt ( hp - > md5_req , & sg , NULL , keylen ) ;
2020-07-01 21:43:04 +03:00
/* We use data_race() because tcp_md5_do_add() might change key->key under us */
return data_race ( crypto_ahash_update ( hp - > md5_req ) ) ;
2008-07-19 11:01:42 +04:00
}
EXPORT_SYMBOL ( tcp_md5_hash_key ) ;
2022-02-23 20:57:40 +03:00
/* Called with rcu_read_lock() */
2022-03-08 03:44:21 +03:00
enum skb_drop_reason
tcp_inbound_md5_hash ( const struct sock * sk , const struct sk_buff * skb ,
const void * saddr , const void * daddr ,
int family , int dif , int sdif )
2022-02-23 20:57:40 +03:00
{
/*
* This gets called for each TCP segment that arrives
* so we want to be efficient .
* We have 3 drop cases :
* o No MD5 hash and one expected .
* o MD5 hash and we ' re not expecting one .
* o MD5 hash and its wrong .
*/
const __u8 * hash_location = NULL ;
struct tcp_md5sig_key * hash_expected ;
const struct tcphdr * th = tcp_hdr ( skb ) ;
2023-03-17 18:55:39 +03:00
const struct tcp_sock * tp = tcp_sk ( sk ) ;
2022-02-23 20:57:40 +03:00
int genhash , l3index ;
u8 newhash [ 16 ] ;
/* sdif set, means packet ingressed via a device
* in an L3 domain and dif is set to the l3mdev
*/
l3index = sdif ? dif : 0 ;
hash_expected = tcp_md5_do_lookup ( sk , l3index , saddr , family ) ;
hash_location = tcp_parse_md5sig_option ( th ) ;
/* We've parsed the options - do we have a hash? */
if ( ! hash_expected & & ! hash_location )
2022-03-08 03:44:21 +03:00
return SKB_NOT_DROPPED_YET ;
2022-02-23 20:57:40 +03:00
if ( hash_expected & & ! hash_location ) {
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPMD5NOTFOUND ) ;
2022-03-08 03:44:21 +03:00
return SKB_DROP_REASON_TCP_MD5NOTFOUND ;
2022-02-23 20:57:40 +03:00
}
if ( ! hash_expected & & hash_location ) {
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPMD5UNEXPECTED ) ;
2022-03-08 03:44:21 +03:00
return SKB_DROP_REASON_TCP_MD5UNEXPECTED ;
2022-02-23 20:57:40 +03:00
}
2022-07-26 14:57:43 +03:00
/* Check the signature.
* To support dual stack listeners , we need to handle
* IPv4 - mapped case .
*/
if ( family = = AF_INET )
genhash = tcp_v4_md5_hash_skb ( newhash ,
hash_expected ,
NULL , skb ) ;
else
genhash = tp - > af_specific - > calc_md5_hash ( newhash ,
hash_expected ,
NULL , skb ) ;
2022-02-23 20:57:40 +03:00
if ( genhash | | memcmp ( hash_location , newhash , 16 ) ! = 0 ) {
NET_INC_STATS ( sock_net ( sk ) , LINUX_MIB_TCPMD5FAILURE ) ;
if ( family = = AF_INET ) {
net_info_ratelimited ( " MD5 Hash failed for (%pI4, %d)->(%pI4, %d)%s L3 index %d \n " ,
saddr , ntohs ( th - > source ) ,
daddr , ntohs ( th - > dest ) ,
genhash ? " tcp_v4_calc_md5_hash failed "
: " " , l3index ) ;
} else {
net_info_ratelimited ( " MD5 Hash %s for [%pI6c]:%u->[%pI6c]:%u L3 index %d \n " ,
genhash ? " failed " : " mismatch " ,
saddr , ntohs ( th - > source ) ,
daddr , ntohs ( th - > dest ) , l3index ) ;
}
2022-03-08 03:44:21 +03:00
return SKB_DROP_REASON_TCP_MD5FAILURE ;
2022-02-23 20:57:40 +03:00
}
2022-03-08 03:44:21 +03:00
return SKB_NOT_DROPPED_YET ;
2022-02-23 20:57:40 +03:00
}
EXPORT_SYMBOL ( tcp_inbound_md5_hash ) ;
2006-11-15 06:07:45 +03:00
# endif
2007-04-21 04:11:46 +04:00
void tcp_done ( struct sock * sk )
{
2019-10-11 06:17:38 +03:00
struct request_sock * req ;
2012-08-31 16:29:12 +04:00
2019-10-14 16:47:57 +03:00
/* We might be called with a new socket, after
* inet_csk_prepare_forced_close ( ) has been called
* so we can not use lockdep_sock_is_held ( sk )
*/
req = rcu_dereference_protected ( tcp_sk ( sk ) - > fastopen_rsk , 1 ) ;
2012-08-31 16:29:12 +04:00
2008-11-03 11:24:34 +03:00
if ( sk - > sk_state = = TCP_SYN_SENT | | sk - > sk_state = = TCP_SYN_RECV )
2016-04-30 00:16:47 +03:00
TCP_INC_STATS ( sock_net ( sk ) , TCP_MIB_ATTEMPTFAILS ) ;
2007-04-21 04:11:46 +04:00
tcp_set_state ( sk , TCP_CLOSE ) ;
tcp_clear_xmit_timers ( sk ) ;
2015-04-03 11:17:27 +03:00
if ( req )
2012-08-31 16:29:12 +04:00
reqsk_fastopen_remove ( sk , req , false ) ;
2007-04-21 04:11:46 +04:00
2023-05-09 23:36:56 +03:00
WRITE_ONCE ( sk - > sk_shutdown , SHUTDOWN_MASK ) ;
2007-04-21 04:11:46 +04:00
if ( ! sock_flag ( sk , SOCK_DEAD ) )
sk - > sk_state_change ( sk ) ;
else
inet_csk_destroy_sock ( sk ) ;
}
EXPORT_SYMBOL_GPL ( tcp_done ) ;
2015-12-16 06:30:05 +03:00
int tcp_abort ( struct sock * sk , int err )
{
2022-06-27 15:10:38 +03:00
int state = inet_sk_state_load ( sk ) ;
2015-12-18 03:14:11 +03:00
2022-06-27 15:10:38 +03:00
if ( state = = TCP_NEW_SYN_RECV ) {
struct request_sock * req = inet_reqsk ( sk ) ;
local_bh_disable ( ) ;
inet_csk_reqsk_queue_drop ( req - > rsk_listener , req ) ;
local_bh_enable ( ) ;
return 0 ;
}
if ( state = = TCP_TIME_WAIT ) {
struct inet_timewait_sock * tw = inet_twsk ( sk ) ;
refcount_inc ( & tw - > tw_refcnt ) ;
local_bh_disable ( ) ;
inet_twsk_deschedule_put ( tw ) ;
local_bh_enable ( ) ;
return 0 ;
2015-12-16 06:30:05 +03:00
}
bpf: Add bpf_sock_destroy kfunc
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.
The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.
The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.
Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-20 01:51:55 +03:00
/* BPF context ensures sock locking. */
if ( ! has_current_bpf_ctx ( ) )
/* Don't race with userspace socket closes such as tcp_close. */
lock_sock ( sk ) ;
2015-12-16 06:30:05 +03:00
2015-12-21 18:03:44 +03:00
if ( sk - > sk_state = = TCP_LISTEN ) {
tcp_set_state ( sk , TCP_CLOSE ) ;
inet_csk_listen_stop ( sk ) ;
}
2015-12-16 06:30:05 +03:00
/* Don't race with BH socket closes such as inet_csk_listen_stop. */
local_bh_disable ( ) ;
bh_lock_sock ( sk ) ;
if ( ! sock_flag ( sk , SOCK_DEAD ) ) {
2023-03-15 23:57:44 +03:00
WRITE_ONCE ( sk - > sk_err , err ) ;
2015-12-16 06:30:05 +03:00
/* This barrier is coupled with smp_rmb() in tcp_poll() */
smp_wmb ( ) ;
2021-06-28 01:48:21 +03:00
sk_error_report ( sk ) ;
2015-12-16 06:30:05 +03:00
if ( tcp_need_reset ( sk - > sk_state ) )
tcp_send_active_reset ( sk , GFP_ATOMIC ) ;
tcp_done ( sk ) ;
}
bh_unlock_sock ( sk ) ;
local_bh_enable ( ) ;
2018-03-07 01:15:12 +03:00
tcp_write_queue_purge ( sk ) ;
bpf: Add bpf_sock_destroy kfunc
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.
The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.
The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.
Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-20 01:51:55 +03:00
if ( ! has_current_bpf_ctx ( ) )
release_sock ( sk ) ;
2015-12-16 06:30:05 +03:00
return 0 ;
}
EXPORT_SYMBOL_GPL ( tcp_abort ) ;
2005-06-24 07:37:36 +04:00
extern struct tcp_congestion_ops tcp_reno ;
2005-04-17 02:20:36 +04:00
static __initdata unsigned long thash_entries ;
static int __init set_thash_entries ( char * str )
{
2012-05-19 18:13:18 +04:00
ssize_t ret ;
2005-04-17 02:20:36 +04:00
if ( ! str )
return 0 ;
2012-05-19 18:13:18 +04:00
ret = kstrtoul ( str , 0 , & thash_entries ) ;
if ( ret )
return 0 ;
2005-04-17 02:20:36 +04:00
return 1 ;
}
__setup ( " thash_entries= " , set_thash_entries ) ;
2014-10-01 20:27:50 +04:00
static void __init tcp_init_mem ( void )
2012-01-30 05:20:17 +04:00
{
2015-05-15 22:39:30 +03:00
unsigned long limit = nr_free_buffer_pages ( ) / 16 ;
2012-01-30 05:20:17 +04:00
limit = max ( limit , 128UL ) ;
2015-05-15 22:39:30 +03:00
sysctl_tcp_mem [ 0 ] = limit / 4 * 3 ; /* 4.68 % */
sysctl_tcp_mem [ 1 ] = limit ; /* 6.25 % */
sysctl_tcp_mem [ 2 ] = sysctl_tcp_mem [ 0 ] * 2 ; /* 9.37 % */
2012-01-30 05:20:17 +04:00
}
2005-04-17 02:20:36 +04:00
void __init tcp_init ( void )
{
2012-05-02 06:28:41 +04:00
int max_rshare , max_wshare , cnt ;
2016-09-20 06:39:12 +03:00
unsigned long limit ;
2012-02-09 00:39:07 +04:00
unsigned int i ;
2005-04-17 02:20:36 +04:00
2019-05-18 03:17:22 +03:00
BUILD_BUG_ON ( TCP_MIN_SND_MSS < = MAX_TCP_OPTION_SPACE ) ;
2016-09-20 06:39:12 +03:00
BUILD_BUG_ON ( sizeof ( struct tcp_skb_cb ) >
2019-12-09 21:31:43 +03:00
sizeof_field ( struct sk_buff , cb ) ) ;
2005-04-17 02:20:36 +04:00
2014-09-08 04:51:29 +04:00
percpu_counter_init ( & tcp_sockets_allocated , 0 , GFP_KERNEL ) ;
2021-10-14 16:41:26 +03:00
timer_setup ( & tcp_orphan_timer , tcp_orphan_update , TIMER_DEFERRABLE ) ;
mod_timer ( & tcp_orphan_timer , jiffies + TCP_ORPHAN_TIMER_PERIOD ) ;
2017-12-01 23:52:32 +03:00
inet_hashinfo2_init ( & tcp_hashinfo , " tcp_listen_portaddr_hash " ,
thash_entries , 21 , /* one slot per 2 MB*/
0 , 64 * 1024 ) ;
2005-08-10 07:07:35 +04:00
tcp_hashinfo . bind_bucket_cachep =
kmem_cache_create ( " tcp_bind_bucket " ,
sizeof ( struct inet_bind_bucket ) , 0 ,
2021-07-19 13:44:37 +03:00
SLAB_HWCACHE_ALIGN | SLAB_PANIC |
SLAB_ACCOUNT ,
NULL ) ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 21:10:21 +03:00
tcp_hashinfo . bind2_bucket_cachep =
kmem_cache_create ( " tcp_bind2_bucket " ,
sizeof ( struct inet_bind2_bucket ) , 0 ,
SLAB_HWCACHE_ALIGN | SLAB_PANIC |
SLAB_ACCOUNT ,
NULL ) ;
2005-04-17 02:20:36 +04:00
/* Size and allocate the main established and bind bucket
* hash tables .
*
* The methodology is similar to that of the buffer cache .
*/
2005-08-10 07:07:35 +04:00
tcp_hashinfo . ehash =
2005-04-17 02:20:36 +04:00
alloc_large_system_hash ( " TCP established " ,
2005-08-10 06:59:44 +04:00
sizeof ( struct inet_ehash_bucket ) ,
2005-04-17 02:20:36 +04:00
thash_entries ,
2012-11-30 14:08:52 +04:00
17 , /* one slot per 128 KB of memory */
2006-11-07 10:10:51 +03:00
0 ,
2005-04-17 02:20:36 +04:00
NULL ,
2009-10-09 04:16:19 +04:00
& tcp_hashinfo . ehash_mask ,
2012-05-23 17:33:35 +04:00
0 ,
2007-10-30 10:59:25 +03:00
thash_entries ? 0 : 512 * 1024 ) ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
for ( i = 0 ; i < = tcp_hashinfo . ehash_mask ; i + + )
2008-11-17 06:40:17 +03:00
INIT_HLIST_NULLS_HEAD ( & tcp_hashinfo . ehash [ i ] . chain , i ) ;
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
2007-11-07 13:40:20 +03:00
if ( inet_ehash_locks_alloc ( & tcp_hashinfo ) )
panic ( " TCP: failed to alloc ehash_locks " ) ;
2005-08-10 07:07:35 +04:00
tcp_hashinfo . bhash =
2022-06-15 22:32:13 +03:00
alloc_large_system_hash ( " TCP bind " ,
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 21:10:21 +03:00
2 * sizeof ( struct inet_bind_hashbucket ) ,
2009-10-09 04:16:19 +04:00
tcp_hashinfo . ehash_mask + 1 ,
2012-11-30 14:08:52 +04:00
17 , /* one slot per 128 KB of memory */
2006-11-07 10:10:51 +03:00
0 ,
2005-08-10 07:07:35 +04:00
& tcp_hashinfo . bhash_size ,
2005-04-17 02:20:36 +04:00
NULL ,
2012-05-23 17:33:35 +04:00
0 ,
2005-04-17 02:20:36 +04:00
64 * 1024 ) ;
2012-02-09 00:39:07 +04:00
tcp_hashinfo . bhash_size = 1U < < tcp_hashinfo . bhash_size ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 21:10:21 +03:00
tcp_hashinfo . bhash2 = tcp_hashinfo . bhash + tcp_hashinfo . bhash_size ;
2005-08-10 07:07:35 +04:00
for ( i = 0 ; i < tcp_hashinfo . bhash_size ; i + + ) {
spin_lock_init ( & tcp_hashinfo . bhash [ i ] . lock ) ;
INIT_HLIST_HEAD ( & tcp_hashinfo . bhash [ i ] . chain ) ;
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22 21:10:21 +03:00
spin_lock_init ( & tcp_hashinfo . bhash2 [ i ] . lock ) ;
INIT_HLIST_HEAD ( & tcp_hashinfo . bhash2 [ i ] . chain ) ;
2005-04-17 02:20:36 +04:00
}
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-08 04:10:22 +03:00
tcp_hashinfo . pernet = false ;
2010-08-26 10:02:17 +04:00
cnt = tcp_hashinfo . ehash_mask + 1 ;
sysctl_tcp_max_orphans = cnt / 2 ;
2005-04-17 02:20:36 +04:00
2013-10-20 03:25:36 +04:00
tcp_init_mem ( ) ;
2012-02-02 04:07:00 +04:00
/* Set per-socket limits to no more than 1/128 the pressure threshold */
2012-04-10 04:56:42 +04:00
limit = nr_free_buffer_pages ( ) < < ( PAGE_SHIFT - 7 ) ;
2012-05-02 06:28:41 +04:00
max_wshare = min ( 4UL * 1024 * 1024 , limit ) ;
max_rshare = min ( 6UL * 1024 * 1024 , limit ) ;
2006-03-25 12:34:07 +03:00
2022-06-09 09:34:07 +03:00
init_net . ipv4 . sysctl_tcp_wmem [ 0 ] = PAGE_SIZE ;
2017-11-07 11:29:28 +03:00
init_net . ipv4 . sysctl_tcp_wmem [ 1 ] = 16 * 1024 ;
init_net . ipv4 . sysctl_tcp_wmem [ 2 ] = max ( 64 * 1024 , max_wshare ) ;
2006-03-25 12:34:07 +03:00
2022-06-09 09:34:07 +03:00
init_net . ipv4 . sysctl_tcp_rmem [ 0 ] = PAGE_SIZE ;
2018-09-27 21:21:19 +03:00
init_net . ipv4 . sysctl_tcp_rmem [ 1 ] = 131072 ;
init_net . ipv4 . sysctl_tcp_rmem [ 2 ] = max ( 131072 , max_rshare ) ;
2005-04-17 02:20:36 +04:00
2012-03-12 11:03:32 +04:00
pr_info ( " Hash tables configured (established %u bind %u) \n " ,
2012-03-11 22:36:11 +04:00
tcp_hashinfo . ehash_mask + 1 , tcp_hashinfo . bhash_size ) ;
2005-06-23 23:19:55 +04:00
2016-12-28 12:52:32 +03:00
tcp_v4_init ( ) ;
2012-07-10 11:49:14 +04:00
tcp_metrics_init ( ) ;
2014-09-27 00:37:32 +04:00
BUG_ON ( tcp_register_congestion_control ( & tcp_reno ) ! = 0 ) ;
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 09:50:31 +04:00
tcp_tasklet_init ( ) ;
2020-01-22 03:56:15 +03:00
mptcp_init ( ) ;
2005-04-17 02:20:36 +04:00
}