linux/net/ipv4
Eric Dumazet ca777eff51 tcp: remove dst refcount false sharing for prequeue mode
Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.

We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.

Commit 093162553c (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.

It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.

We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.

Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.

Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.

echo 0 >/proc/sys/net/ipv4/tcp_autocorking

for i in `seq 0 47`
do
  for j in `seq 0 2`
  do
     netperf -H $DEST -t TCP_STREAM -l 1000 \
             -c -C -T $i,$i -P 0 -- \
             -m 64 -s 64K -D &
  done
done

Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:54:41 -07:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-09-07 21:41:53 -07:00
af_inet.c net/ipv4: bind ip_nonlocal_bind to current netns 2014-09-09 11:27:09 -07:00
ah4.c ah4: Use the IPsec protocol multiplexer API 2014-02-25 07:04:17 +01:00
arp.c ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set 2014-01-02 00:08:38 -05:00
cipso_ipv4.c netlabel: shorter names for the NetLabel catmap funcs/structs 2014-08-01 11:17:37 -04:00
datagram.c net: Save TX flow hash in sock and set in skbuf on xmit 2014-07-07 21:14:21 -07:00
devinet.c ipv4: fail early when creating netdev named all or default 2014-07-29 11:43:50 -07:00
esp4.c esp4: Use the IPsec protocol multiplexer API 2014-02-25 07:04:17 +01:00
fib_frontend.c ipv4: Restore accept_local behaviour in fib_validate_source() 2014-08-22 12:23:10 -07:00
fib_lookup.h
fib_rules.c
fib_semantics.c ipv4: fix a race in update_or_create_fnhe() 2014-09-05 17:15:50 -07:00
fib_trie.c list: fix order of arguments for hlist_add_after(_rcu) 2014-08-06 18:01:24 -07:00
gre_demux.c net: Fix GRE RX to use skb_transport_header for GRE header offset 2014-09-08 15:23:05 -07:00
gre_offload.c gre: Add support for checksum unnecessary conversions 2014-09-01 21:36:28 -07:00
icmp.c ipv4: remove nested rcu_read_lock/unlock 2014-08-02 15:27:35 -07:00
igmp.c ipv4: implement igmp_qrv sysctl to tune igmp robustness variable 2014-09-04 22:26:14 -07:00
inet_connection_sock.c ipv4: make ip_local_reserved_ports per netns 2014-05-14 15:31:45 -04:00
inet_diag.c inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets 2014-01-13 22:35:46 -08:00
inet_fragment.c inet: frags: use kmem_cache for inet_frag_queue 2014-08-02 15:31:31 -07:00
inet_hashtables.c net: use reciprocal_scale() helper 2014-08-23 12:21:21 -07:00
inet_lro.c lro: remove dead code 2013-12-29 16:34:25 -05:00
inet_timewait_sock.c
inetpeer.c inet: remove dead inetpeer sequence code 2014-09-08 16:42:42 -07:00
ip_forward.c net: rename local_df to ignore_df 2014-05-12 14:03:41 -04:00
ip_fragment.c inet: frags: use kmem_cache for inet_frag_queue 2014-08-02 15:31:31 -07:00
ip_gre.c gre: allow changing mac address when device is up 2014-06-10 22:46:42 -07:00
ip_input.c net: Fix memory leak if TPROXY used with TCP early demux 2014-01-27 16:22:11 -08:00
ip_options.c ipv4: fix buffer overflow in ip_options_compile() 2014-07-21 20:16:26 -07:00
ip_output.c net-timestamp: add key to disambiguate concurrent datagrams 2014-08-05 16:35:54 -07:00
ip_sockglue.c sock: deduplicate errqueue dequeue 2014-09-01 21:49:08 -07:00
ip_tunnel_core.c net: Support for multiple checksums with gso 2014-06-04 22:46:38 -07:00
ip_tunnel.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-08-05 18:46:26 -07:00
ip_vti.c vti: Simplify error handling in module init and exit 2014-06-26 08:21:57 +02:00
ipcomp.c ipcomp4: Use the IPsec protocol multiplexer API 2014-02-25 07:04:17 +01:00
ipconfig.c ipconfig: Use time_before 2014-08-22 12:23:11 -07:00
ipip.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-06-11 16:02:55 -07:00
ipmr.c net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
Kconfig udp: Add udp_sock_create for UDP tunnels to open listener socket 2014-07-14 16:12:15 -07:00
Makefile udp: Add udp_sock_create for UDP tunnels to open listener socket 2014-07-14 16:12:15 -07:00
netfilter.c netfilter: remove double colon 2014-02-19 11:41:25 +01:00
ping.c net/ipv4: bind ip_nonlocal_bind to current netns 2014-09-09 11:27:09 -07:00
proc.c inet: frag: don't account number of fragment queues 2014-07-27 22:34:36 -07:00
protocol.c
raw.c ipv4: Make IP_MULTICAST_ALL and IP_MSFILTER work on raw sockets 2014-07-23 15:13:26 -07:00
route.c ipv4: harden fnhe_hashfun() 2014-09-05 17:40:33 -07:00
syncookies.c tcp: syncookies: mark cookie_secret read_mostly 2014-08-27 16:30:49 -07:00
sysctl_net_ipv4.c net/ipv4: bind ip_nonlocal_bind to current netns 2014-09-09 11:27:09 -07:00
tcp_bic.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_cong.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_cubic.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_diag.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_fastopen.c tcp: remove unnecessary tcp_sk assignment. 2014-06-16 21:35:00 -07:00
tcp_highspeed.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_htcp.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_hybla.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_illinois.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_input.c tcp: remove TCP_SKB_CB(skb)->when 2014-09-05 17:49:33 -07:00
tcp_ipv4.c tcp: remove dst refcount false sharing for prequeue mode 2014-09-09 16:54:41 -07:00
tcp_lp.c tcp: remove in_flight parameter from cong_avoid() methods 2014-05-03 19:23:07 -04:00
tcp_memcontrol.c cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() 2014-07-15 11:05:09 -04:00
tcp_metrics.c tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic 2014-08-14 14:38:54 -07:00
tcp_minisocks.c tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isn 2014-09-05 17:49:33 -07:00
tcp_offload.c tcp: Call skb_gro_checksum_validate 2014-08-24 18:09:24 -07:00
tcp_output.c tcp: remove obsolete comment about TCP_SKB_CB(skb)->when in tcp_fragment() 2014-09-06 12:29:10 -07:00
tcp_probe.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_scalable.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_timer.c tcp: remove TCP_SKB_CB(skb)->when 2014-09-05 17:49:33 -07:00
tcp_vegas.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_vegas.h
tcp_veno.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_westwood.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_yeah.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp.c tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) 2014-08-14 14:38:54 -07:00
tunnel4.c
udp_diag.c
udp_impl.h
udp_offload.c udp: Add support for doing checksum unnecessary conversion 2014-09-01 21:36:28 -07:00
udp_tunnel.c udp: Add udp_sock_create for UDP tunnels to open listener socket 2014-07-14 16:12:15 -07:00
udp.c net: merge cases where sock_efree and sock_edemux are the same function 2014-09-05 17:43:45 -07:00
udplite.c net: Eliminate no_check from protosw 2014-05-23 16:28:53 -04:00
xfrm4_input.c xfrm4: Add IPsec protocol multiplexer 2014-02-25 07:04:16 +01:00
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c inetpeer: get rid of ip_id_count 2014-06-02 11:00:41 -07:00
xfrm4_output.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-05-24 00:32:30 -04:00
xfrm4_policy.c xfrm: Introduce xfrm_input_afinfo to access the the callbacks properly 2014-03-14 07:28:07 +01:00
xfrm4_protocol.c xfrm4: Remove duplicate semicolon 2014-06-30 07:49:47 +02:00
xfrm4_state.c
xfrm4_tunnel.c