linux/net/core
Kuniyuki Iwashima 333bb73f62 tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

  - we cannot carry over the eBPF prog of shutdowned sockets
  - we cannot attach/detach an eBPF prog to/from listening sockets via
    shutdowned sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
..
bpf_sk_storage.c bpf: Use struct_size() in kzalloc() 2021-05-13 15:58:00 -07:00
datagram.c udp: fix skb_copy_and_csum_datagram with odd segment sizes 2021-02-04 18:56:56 -08:00
datagram.h
dev_addr_lists.c net: core: Correct function name dev_uc_flush() in the kerneldoc 2021-03-28 17:56:56 -07:00
dev_ioctl.c net: fix dev_ifsioc_locked() race condition 2021-02-11 18:14:19 -08:00
dev.c net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT 2021-05-13 13:11:19 -07:00
devlink.c devlink: Extend SF port attributes to have external attribute 2021-04-24 00:58:53 -07:00
drop_monitor.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-03-25 15:31:22 -07:00
dst_cache.c
dst.c net, bpf: Fix ip6ip6 crash with collect_md populated skbs 2021-03-10 12:24:18 -08:00
failover.c
fib_notifier.c
fib_rules.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
filter.c xdp: Extend xdp_redirect_map with broadcast support 2021-05-26 09:46:16 +02:00
flow_dissector.c flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target() 2021-04-16 17:02:27 -07:00
flow_offload.c net: flow_offload: Fix memory leak for indirect flow block 2020-12-09 16:08:33 -08:00
gen_estimator.c net_sched: gen_estimator: support large ewma log 2021-01-15 18:11:06 -08:00
gen_stats.c docs: networking: convert gen_stats.txt to ReST 2020-04-28 14:39:46 -07:00
gro_cells.c gro_cells: reduce number of synchronize_net() calls 2020-11-25 11:28:12 -08:00
hwbm.c
link_watch.c net: Add IF_OPER_TESTING 2020-04-20 12:43:24 -07:00
lwt_bpf.c lwt_bpf: Replace preempt_disable() with migrate_disable() 2020-12-07 11:53:40 -08:00
lwtunnel.c net: ipv6: add rpl sr tunnel 2020-03-29 22:30:57 -07:00
Makefile net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
neighbour.c neighbour: Remove redundant initialization of 'bucket' 2021-05-10 14:25:13 -07:00
net_namespace.c net: initialize net->net_cookie at netns setup 2021-02-11 14:10:07 -08:00
net-procfs.c net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h 2021-03-22 13:14:45 -07:00
net-sysfs.c net-sysfs: remove possible sleep from an RCU read-side critical section 2021-03-22 13:28:13 -07:00
net-sysfs.h net-sysfs: add netdev_change_owner() 2020-02-26 20:07:25 -08:00
net-traces.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
netclassid_cgroup.c net: Remove the err argument from sock_from_file 2020-12-04 22:32:40 +01:00
netevent.c net: core: Correct function name netevent_unregister_notifier() in the kerneldoc 2021-03-28 17:56:56 -07:00
netpoll.c Revert "net: Have netpoll bring-up DSA management interface" 2021-02-06 14:42:57 -08:00
netprio_cgroup.c net: Remove the err argument from sock_from_file 2020-12-04 22:32:40 +01:00
page_pool.c net: page_pool: use alloc_pages_bulk in refill code path 2021-04-30 11:20:43 -07:00
pktgen.c pktgen: fix misuse of BUG_ON() in pktgen_thread_worker() 2021-01-27 16:46:37 -08:00
ptp_classifier.c ptp: Add generic ptp v2 header parsing function 2020-08-19 16:07:49 -07:00
request_sock.c
rtnetlink.c rtnetlink: avoid RCU read lock when holding RTNL 2021-05-10 14:33:10 -07:00
scm.c scm: fix a typo in put_cmsg() 2021-04-16 11:41:07 -07:00
secure_seq.c crypto: lib/sha1 - remove unnecessary includes of linux/cryptohash.h 2020-05-08 15:32:17 +10:00
selftests.c net: add generic selftest support 2021-04-20 16:08:02 -07:00
skbuff.c Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
skmsg.c skmsg: Remove unused parameters of sk_msg_wait_data() 2021-05-18 16:44:19 +02:00
sock_diag.c bpf, net: Rework cookie generator as per-cpu one 2020-09-30 11:50:35 -07:00
sock_map.c sock_map: Fix a potential use-after-free in sock_map_close() 2021-04-12 17:35:26 +02:00
sock_reuseport.c tcp: Keep TCP_CLOSE sockets in the reuseport group. 2021-06-15 18:01:05 +02:00
sock.c net: sock: remove the unnecessary check in proto_register 2021-04-23 13:10:03 -07:00
stream.c
sysctl_net_core.c net: change netdev_unregister_timeout_secs min value to 1 2021-03-25 17:24:06 -07:00
timestamping.c
tso.c net: tso: add UDP segmentation support 2020-06-18 20:46:23 -07:00
utils.c net: Fix skb->csum update in inet_proto_csum_replace16(). 2020-01-24 20:54:30 +01:00
xdp.c xdp: Extend xdp_redirect_map with broadcast support 2021-05-26 09:46:16 +02:00