linux/net
Joanne Koong 28044fc1d4 net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.

This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.

Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:

  1) The case where there is a bind() call on INADDR_ANY (ipv4) or
  IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
  assign the socket an address when it handles the connect()

  2) In inet_sk_reselect_saddr(), which is called when rebuilding the
  sk header and a few pre-conditions are met (eg rerouting fails).

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.

* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.

This brings up a few stipulations:
  1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
  will always be acquired after the bhash lock and released before the
  bhash lock is released.

  2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
  acquired+released before another bhash2 lock is acquired+released.

* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-24 19:30:07 -07:00
..
6lowpan
9p iov_iter stuff, part 2, rebased 2022-08-08 20:04:35 -07:00
802
8021q vlan: move from strlcpy with unused retval to strscpy 2022-08-22 17:55:48 -07:00
appletalk
atm
ax25 ax25: move from strlcpy with unused retval to strscpy 2022-08-22 17:55:50 -07:00
batman-adv batman-adv: tracing: Use the new __vstring() helper 2022-07-30 13:52:47 -04:00
bluetooth Bluetooth: ISO: Fix not using the correct QoS 2022-08-08 17:06:36 -07:00
bpf Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-08-17 20:29:36 -07:00
bpfilter
bridge net: bridge: move DSA master bridging restriction to DSA 2022-08-23 11:39:22 +02:00
caif caif: move from strlcpy with unused retval to strscpy 2022-08-22 17:57:35 -07:00
can can: j1939: j1939_session_destroy(): fix memory leak of skbs 2022-08-09 09:05:06 +02:00
ceph libceph: clean up ceph_osdc_start_request prototype 2022-08-03 14:05:39 +02:00
core net: skb: prevent the split of kfree_skb_reason() by gcc 2022-08-24 09:47:51 +01:00
dcb
dccp net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
dns_resolver
dsa net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_master 2022-08-23 11:39:22 +02:00
ethernet
ethtool ethtool: move from strlcpy with unused retval to strscpy 2022-08-22 18:06:55 -07:00
hsr treewide: Replace GPLv2 boilerplate/reference with SPDX - gpl-2.0_30.RULE (part 2) 2022-06-10 14:51:35 +02:00
ieee802154
ife
ipv4 net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
ipv6 net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
iucv net: keep sk->sk_forward_alloc as small as possible 2022-06-10 16:21:27 -07:00
kcm
key xfrm: change the type of xfrm_register_km and xfrm_unregister_km 2022-06-24 10:19:11 +02:00
l2tp l2tp: move from strlcpy with unused retval to strscpy 2022-08-22 17:59:46 -07:00
l3mdev
lapb
llc
mac80211 Tracing updates for 5.20 / 6.0 2022-08-05 09:41:12 -07:00
mac802154
mctp
mpls
mptcp mptcp: do not queue data on closed subflows 2022-08-05 08:51:28 +01:00
ncsi net/ncsi: use proper "mellanox" DT vendor prefix 2022-06-23 20:51:06 -07:00
netfilter Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
netlabel netlabel: fix typo in comment 2022-08-10 09:24:41 +01:00
netlink net: genl: fix error path memory leak in policy dumping 2022-08-18 10:20:48 -07:00
netrom
nfc
nsh
openvswitch openvswitch: move from strlcpy with unused retval to strscpy 2022-08-22 18:07:12 -07:00
packet packet: move from strlcpy with unused retval to strscpy 2022-08-22 17:59:51 -07:00
phonet
psample
qrtr net: qrtr: start MHI channel after endpoit creation 2022-08-15 11:21:42 +01:00
rds rds: add missing barrier to release_refill 2022-08-12 10:46:01 +01:00
rfkill
rose net: rose: add netdev ref tracker to 'struct rose_sock' 2022-08-01 11:59:23 -07:00
rxrpc net: delete extra space and tab in blank line 2022-07-25 19:38:31 -07:00
sched net: sched: remove duplicate check of user rights in qdisc 2022-08-23 10:18:26 +02:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-28 18:21:16 -07:00
smc net/smc: Enable module load on netlink usage 2022-07-27 13:24:42 +01:00
strparser strparser: pad sk_skb_cb to avoid straddling cachelines 2022-07-08 18:38:44 -07:00
sunrpc net/sunrpc: fix potential memory leaks in rpc_sysfs_xprt_state_change() 2022-08-12 11:21:28 +01:00
switchdev
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-28 18:21:16 -07:00
tls tls: rx: react to strparser initialization errors 2022-08-17 10:24:00 +01:00
unix af_unix: Show number of inflight fds for sockets in TCP_LISTEN state too 2022-08-22 11:34:54 +01:00
vmw_vsock vmci/vsock: check SO_RCVLOWAT before wake up reader 2022-08-23 10:43:12 +02:00
wireless wifi: cfg80211: Fix validating BSS pointers in __cfg80211_connect_result 2022-08-08 11:09:52 +03:00
x25 net/x25: fix call timeouts in blocking connects 2022-08-08 20:48:51 -07:00
xdp xsk: Mark napi_id on sendmsg() 2022-07-14 22:45:34 +02:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2022-07-25 13:25:39 +01:00
compat.c Merge branch 'for-5.20/io_uring' into for-5.20/io_uring-zerocopy-send 2022-07-24 18:41:03 -06:00
devres.c
Kconfig Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
Kconfig.debug
Makefile Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
socket.c Networking changes for 6.0. 2022-08-03 16:29:08 -07:00
sysctl_net.c