linux/net
Kuniyuki Iwashima b261eda84e soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

 cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:35:16 +02:00
..
6lowpan
9p net/9p: clarify trans_fd parse_opt failure handling 2022-10-07 21:23:09 +09:00
802 treewide: use prandom_u32_max() when possible, part 1 2022-10-11 17:42:55 -06:00
8021q net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
appletalk
atm net/atm: fix proc_mpc_write incorrect return value 2022-10-15 11:08:36 +01:00
ax25 ax25: move from strlcpy with unused retval to strscpy 2022-08-22 17:55:50 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-09-22 13:02:10 -07:00
bluetooth Driver core changes for 6.1-rc1 2022-10-07 17:04:10 -07:00
bpf selftests/bpf: Add tests for kfunc returning a memory pointer 2022-09-07 11:05:17 -07:00
bpfilter
bridge bridge: mcast: Simplify MDB entry creation 2022-10-19 14:01:08 +01:00
caif caif: move from strlcpy with unused retval to strscpy 2022-08-22 17:57:35 -07:00
can can: bcm: check the result of can_send() in bcm_can_tx() 2022-09-23 13:53:10 +02:00
ceph Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
core soreuseport: Fix socket selection for SO_INCOMING_CPU. 2022-10-25 11:35:16 +02:00
dcb
dccp dccp: Call inet6_destroy_sock() via sk->sk_destruct(). 2022-10-24 09:40:38 +01:00
dns_resolver
dsa net: dsa: uninitialized variable in dsa_slave_netdevice_event() 2022-10-15 11:15:27 +01:00
ethernet net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
ethtool Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-24 13:44:11 -07:00
hsr net: hsr: avoid possible NULL deref in skb_clone() 2022-10-18 19:18:27 -07:00
ieee802154 net/ieee802154: don't warn zero-sized raw_sendmsg() 2022-10-05 12:37:10 +02:00
ife
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-24 13:44:11 -07:00
ipv6 net: remove useless parameter of __sock_cmsg_send 2022-10-24 12:43:46 +01:00
iucv
kcm kcm: annotate data-races around kcm->rx_wait 2022-10-24 10:57:55 +01:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2022-08-24 12:51:50 +01:00
l2tp inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy(). 2022-10-24 09:40:38 +01:00
l3mdev
lapb
llc
mac80211 Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
mac802154 net: mac802154: Fix a condition in the receive path 2022-08-29 11:10:22 +02:00
mctp mctp: prevent double key removal and unref 2022-10-12 13:30:50 +01:00
mpls net: Use u64_stats_fetch_begin_irq() for stats fetch. 2022-08-29 13:02:27 +01:00
mptcp net: introduce and use custom sockopt socket flag 2022-10-24 10:52:50 +01:00
ncsi genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
netfilter Networking fixes for 6.1-rc2, including fixes from netfilter 2022-10-20 17:24:59 -07:00
netlabel genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
netlink net: add a refcount tracker for kernel sockets 2022-10-24 11:04:43 +01:00
netrom
nfc NFC: hci: Split memcpy() of struct hcp_message flexible array 2022-09-27 07:45:18 -07:00
nsh
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-20 17:49:10 -07:00
packet treewide: use prandom_u32_max() when possible, part 1 2022-10-11 17:42:55 -06:00
phonet
psample genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
qrtr net: qrtr: start MHI channel after endpoit creation 2022-08-15 11:21:42 +01:00
rds net: add a refcount tracker for kernel sockets 2022-10-24 11:04:43 +01:00
rfkill
rose rose: check NULL rose_loopback_neigh->loopback 2022-08-22 14:24:54 +01:00
rxrpc rxrpc: remove rxrpc_max_call_lifetime declaration 2022-09-19 17:58:47 -07:00
sched act_skbedit: skbedit queue mapping for receive queue 2022-10-25 10:32:40 +02:00
sctp sctp: Call inet6_destroy_sock() via sk->sk_destruct(). 2022-10-24 09:40:39 +01:00
smc net/smc: Fix an error code in smc_lgr_create() 2022-10-15 11:12:12 +01:00
strparser
sunrpc Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
switchdev
tipc tipc: fix a null-ptr-deref in tipc_topsrv_accept 2022-10-20 21:08:17 -07:00
tls tls: strp: make sure the TCP skbs do not have overlapping data 2022-10-14 08:25:26 +01:00
unix Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-03 17:44:18 -07:00
wireless Merge branch 'cve-fixes-2022-10-13' 2022-10-13 11:59:56 +02:00
x25 net/x25: fix call timeouts in blocking connects 2022-08-08 20:48:51 -07:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-10-03 17:44:18 -07:00
xfrm treewide: use prandom_u32_max() when possible, part 1 2022-10-11 17:42:55 -06:00
compat.c net: clear msg_get_inq in __get_compat_msghdr() 2022-09-20 08:23:20 -07:00
devres.c
Kconfig Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
Kconfig.debug net: make NET_(DEV|NS)_REFCNT_TRACKER depend on NET 2022-09-20 14:23:56 -07:00
Makefile Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
socket.c net: introduce and use custom sockopt socket flag 2022-10-24 10:52:50 +01:00
sysctl_net.c