linux/drivers/net
Jason A. Donenfeld 8b5553ace8 wireguard: queueing: get rid of per-peer ring buffers
Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 500,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.

In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb->next pointer (the skb_list_* singly
linked list approach), so we form the per-peer queue around the unused
skb->prev pointer (which sort of makes sense because the links are
pointing backwards). Use of skb_queue_* is not possible here, because
that is based on doubly linked lists and spinlocks. Multiple cores can
write into the queue at any given time, because its writes occur in the
start_xmit path or in the udp_recv path. But reads happen in a single
workqueue item per-peer, amounting to a multi-producer, single-consumer
paradigm.

The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (some tiny fraction of the
0.15% of partial adds on a fully loaded 16-core x86_64 system), causes
the queue reader to terminate early. However, because every packet sent
queues up the same workqueue item after it is fully added, the worker
resumes again, and stopping early isn't actually a problem, since at
that point the packet wouldn't have yet been added to the encryption
queue. These properties allow us to avoid disabling interrupts or
spinning. The design is based on Dmitry Vyukov's algorithm [1].

Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.

Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.

But all and all, this enables us to get massive memory savings, allowing
WireGuard to scale for real world deployments, without taking much of a
performance hit.

[1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Fixes: e7096c131e ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-23 15:59:34 -08:00
..
appletalk
arcnet arcnet: use new tasklet API 2021-02-02 15:51:17 -08:00
bonding bonding: 3ad: Print an error for unknown speeds 2021-02-11 14:28:21 -08:00
caif TTY/Serial driver changes for 5.12-rc1 2021-02-20 21:28:04 -08:00
can can: mcp251xfd: mcp251xfd_probe(): use dev_err_probe() to simplify error handling 2021-01-29 09:31:58 +01:00
dsa net: dsa: b53: Support setting learning on port 2021-02-23 12:23:00 -08:00
ethernet Marvell Sky2 Ethernet adapter: fix warning messages. 2021-02-23 12:17:21 -08:00
fddi
fjes
hamradio
hippi
hyperv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-10 13:30:12 -08:00
ieee802154
ipa Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
ipvlan
mdio net: mdio: Remove of_phy_attach() 2021-02-17 13:17:49 -08:00
mhi net: mhi: Add mbim proto 2021-02-10 15:11:51 -08:00
netdevsim netdevsim: fib: Add debugfs to debug route offload failure 2021-02-08 16:47:03 -08:00
pcs net: pcs: add pcs-lynx 1000BASE-X support 2021-02-06 14:35:21 -08:00
phy net: phy: icplus: call phy_restore_page() when phy_select_page() fails 2021-02-22 18:47:48 -08:00
plip
ppp TTY/Serial driver changes for 5.12-rc1 2021-02-20 21:28:04 -08:00
slip
team
usb r8152: spilt rtl_set_eee_plus and r8153b_green_en 2021-02-23 12:36:02 -08:00
vmxnet3 vmxnet3: Remove buf_info from device accessible structures 2021-01-29 21:07:03 -08:00
wan Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
wireguard wireguard: queueing: get rid of per-peer ring buffers 2021-02-23 15:59:34 -08:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
xen-netback xen/events: link interdomain events to associated xenbus device 2021-02-11 14:47:00 -08:00
bareudp.c udp: call udp_encap_enable for v6 sockets when enabling encap 2021-02-04 18:37:14 -08:00
dummy.c
eql.c
geneve.c
gtp.c net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending 2021-02-23 11:29:52 -08:00
ifb.c ifb: use new tasklet API 2021-02-02 15:51:18 -08:00
Kconfig
LICENSE.SRC
loopback.c Revert "net-loopback: set lo dev initial state to UP" 2021-02-11 13:10:44 -08:00
macsec.c
macvlan.c
macvtap.c
Makefile net: mhi: Add dedicated folder 2021-02-10 15:11:51 -08:00
mdio.c
mii.c
net_failover.c
netconsole.c
nlmon.c
ntb_netdev.c
rionet.c
sb1000.c
Space.c
sungem_phy.c
tap.c net: fix dev_ifsioc_locked() race condition 2021-02-11 18:14:19 -08:00
thunderbolt.c
tun.c net: fix dev_ifsioc_locked() race condition 2021-02-11 18:14:19 -08:00
veth.c net, veth: Alloc skb in bulk for ndo_xdp_xmit 2021-02-04 01:00:07 +01:00
virtio_net.c
vrf.c
vsockmon.c
vxlan.c vxlan: move debug check after netdev unregister 2021-02-23 13:03:02 -08:00
xen-netfront.c drivers: net: xen-netfront: Simplify the calculation of variables 2021-02-04 10:55:24 -08:00