Commit Graph

76572 Commits

Author SHA1 Message Date
Eric Dumazet
08842c43d0 udp: no longer touch sk->sk_refcnt in early demux
After commits ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
and 7ae215d23c ("bpf: Don't refcount LISTEN sockets in sk_assign()")
UDP early demux no longer need to grab a refcount on the UDP socket.

This save two atomic operations per incoming packet for connected
sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Joe Stringer <joe@wand.net.nz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:56:03 +00:00
Gavrilov Ilia
d6eb8de201 net/x25: fix incorrect parameter validation in the x25_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:22 +00:00
Gavrilov Ilia
3ed5f41513 net: kcm: fix incorrect parameter validation in the kcm_getsockopt) function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:22 +00:00
Gavrilov Ilia
4bb3ba7b74 udp: fix incorrect parameter validation in the udp_lib_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:22 +00:00
Gavrilov Ilia
955e9876ba l2tp: fix incorrect parameter validation in the pppol2tp_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 3557baabf2 ("[L2TP]: PPP over L2TP driver core")
Reviewed-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:22 +00:00
Gavrilov Ilia
5c3be3e0eb ipmr: fix incorrect parameter validation in the ip_mroute_getsockopt() function
The 'olr' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'olr' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:22 +00:00
Gavrilov Ilia
716edc9706 tcp: fix incorrect parameter validation in the do_tcp_getsockopt() function
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11 09:53:21 +00:00
Jakub Kicinski
2f901582f0 bluetooth-next pull request for net-next:
- hci_conn: Only do ACL connections sequentially
  - hci_core: Cancel request on command timeout
  - Remove CONFIG_BT_HS
  - btrtl: Add the support for RTL8852BT/RTL8852BE-VT
  - btusb: Add support Mediatek MT7920
  - btusb: Add new VID/PID 13d3/3602 for MT7925
  - Add new quirk for broken read key length on ATS2851
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmXrU9AZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKV4KD/9Ik0EwI4utskMShX9qxnqi
 8i5LBocslSFWN3gqrNUAJTxwlSgYntRK4L4v+566/Y/DISUV7OLx9hRJ8QpzWhWl
 mKweR3kB2HG8/Su6E2VbzjQTriSBuwPiMIeGwP9H5d+bN+6sNLmcl+II9QjapMYQ
 f13ZA/zQzwDlk8A5jTw1N/cOknblvlNNYUwIPlGzJK9COtQqAVSBRz00ugXmR1LG
 +UZqzOgXSNFQ2m4PCsozy4fCAVk/NaXBsdnKrsQurND30MJw1jKd9lRaIkQ+eLNv
 phYfsYLeDDtjMei4j0t23CKOSceMdvFWLtDn3wpBmZbXs8Avd13FRYxf/U88D09g
 FTNhOLbVyZbWSAEqIMuWZv/EuzZvIpOZRSlCn2hJgJTRuqIi6I9mRDF0ZD4LGUzR
 /Es/Ozfxw9CfHFRFsiM46cGgQ01Ddq4SihZnlTQfdkBPjQcAhiJ3GbIUAZs+HHVB
 QFoFLAJWepInGfmyyFHngEzdh9r5zFsA/+PL6duQ1+HqJFZbhPWtgYXijrjuimZo
 IdcmM4KUUaRWwdDivDq9X5s9luQ1BobNxvVIPlpz61QDu2uMlrilXAgNoZJtalTU
 ltQxxE9oPUv5tb8xybBYklKM9keyjTGzL3Y/LluDPgzUoY+w5gTcvEqD8ByhzEw6
 ouE5TO7r0k1h9BhHZSYHzw==
 =Jqow
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - hci_conn: Only do ACL connections sequentially
 - hci_core: Cancel request on command timeout
 - Remove CONFIG_BT_HS
 - btrtl: Add the support for RTL8852BT/RTL8852BE-VT
 - btusb: Add support Mediatek MT7920
 - btusb: Add new VID/PID 13d3/3602 for MT7925
 - Add new quirk for broken read key length on ATS2851

* tag 'for-net-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (52 commits)
  Bluetooth: hci_sync: Fix UAF in hci_acl_create_conn_sync
  Bluetooth: Fix eir name length
  Bluetooth: ISO: Align broadcast sync_timeout with connection timeout
  Bluetooth: Add new quirk for broken read key length on ATS2851
  Bluetooth: mgmt: remove NULL check in add_ext_adv_params_complete()
  Bluetooth: mgmt: remove NULL check in mgmt_set_connectable_complete()
  Bluetooth: btusb: Add support Mediatek MT7920
  Bluetooth: btmtk: Add MODULE_FIRMWARE() for MT7922
  Bluetooth: btnxpuart: Fix btnxpuart_close
  Bluetooth: ISO: Clean up returns values in iso_connect_ind()
  Bluetooth: fix use-after-free in accessing skb after sending it
  Bluetooth: af_bluetooth: Fix deadlock
  Bluetooth: bnep: Fix out-of-bound access
  Bluetooth: btusb: Fix memory leak
  Bluetooth: msft: Fix memory leak
  Bluetooth: hci_core: Fix possible buffer overflow
  Bluetooth: btrtl: fix out of bounds memory access
  Bluetooth: hci_h5: Add ability to allocate memory for private data
  Bluetooth: hci_sync: Fix overwriting request callback
  Bluetooth: hci_sync: Use QoS to determine which PHY to scan
  ...
====================

Link: https://lore.kernel.org/r/20240308181056.120547-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 20:37:32 -08:00
Jakub Kicinski
2612b9f10c Merge tag 'ieee802154-for-net-next-2024-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next
Stefan Schmidt says:

====================
pull-request: ieee802154-next 2024-03-07

Various cross tree patches for ieee802154v drivers and a resource leak
fix for ieee802154 llsec.

Andy Shevchenko changed GPIO header usage for at86rf230 and mcr20a to
only include needed headers.

Bo Liu converted the at86rf230, mcr20a and mrf24j40 driver regmap
support to use the maple tree register cache.

Fedor Pchelkin fixed a resource leak in the llsec key deletion path.

Ricardo B. Marliere made wpan_phy_class const.

Tejun Heo removed WQ_UNBOUND from a workqueue call in ca8210.

* tag 'ieee802154-for-net-next-2024-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next:
  ieee802154: cfg802154: make wpan_phy_class constant
  ieee802154: mcr20a: Remove unused of_gpio.h
  ieee802154: at86rf230: Replace of_gpio.h by proper one
  mac802154: fix llsec key resources release in mac802154_llsec_key_del
  ieee802154: ca8210: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
  net: ieee802154: mrf24j40: convert to use maple tree register cache
  net: ieee802154: mcr20a: convert to use maple tree register cache
  net: ieee802154: at86rf230: convert to use maple tree register cache
====================

Link: https://lore.kernel.org/r/20240307195105.292085-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 20:35:33 -08:00
Eric Dumazet
d721812aa8 ipv4: raw: check sk->sk_rcvbuf earlier
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307163020.2524409-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 11:39:44 -08:00
Eric Dumazet
026763ece8 ipv6: raw: check sk->sk_rcvbuf earlier
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307162943.2523817-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 11:39:29 -08:00
Ido Schimmel
5d9b7cb383 nexthop: Simplify dump error handling
The only error that can happen during a nexthop dump is insufficient
space in the skb caring the netlink messages (EMSGSIZE). If this happens
and some messages were already filled in, the nexthop code returns the
skb length to signal the netlink core that more objects need to be
dumped.

After commit b5a899154a ("netlink: handle EMSGSIZE errors in the
core") there is no need to handle this error in the nexthop code as it
is now handled in the core.

Simplify the code and simply return the error to the core.

No regressions in nexthop tests:

 # ./fib_nexthops.sh
 Tests passed: 234
 Tests failed:   0

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240307154727.3555462-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 11:39:12 -08:00
Eric Dumazet
1cface552a net: add skb_data_unref() helper
Similar to skb_unref(), add skb_data_unref() to save an expensive
atomic operation (and cache line dirtying) when last reference
on shinfo->dataref is released.

I saw this opportunity on hosts with RAW sockets accidentally
bound to UDP protocol, forcing an skb_clone() on all received packets.

These RAW sockets had their receive queue full, so all clone
packets were immediately dropped.

When UDP recvmsg() consumes later the original skb, skb_release_data()
is hitting atomic_sub_return() quite badly, because skb->clone
has been set permanently.

Note that this patch helps TCP TX performance, because
TCP stack also use (fast) clones.

This means that at least one of the two packets (the main skb or
its clone) will no longer have to perform this atomic operation
in skb_release_data().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240307123446.2302230-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 11:38:45 -08:00
Jakub Kicinski
75c2946db3 wireless-next patches for v6.9
The fourth "new features" pull request for v6.9 with changes both in
 stack and in drivers. The theme in this pull request is to fix sparse
 warnings but we still have some left in wireless subsystem. Otherwise
 quite normal.
 
 Major changes:
 
 rtw89
 
 * NL80211_EXT_FEATURE_SCAN_RANDOM_SN support
 
 * NL80211_EXT_FEATURE_SET_SCAN_DWELL support
 
 rtw88
 
 * support for more rtw8811cu and rtw8821cu devices
 
 mt76
 
 * mt76x2u: add Netgear WNDA3100v3 USB
 
 * mt7915: newer ADIE version support
 
 * mt7925: radio temperature sensor support
 
 * mt7996: remove GCMP IGTK offload
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmXq4hARHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZtOawf9Gf2FAi56zA/4vKJPE/mZzRvNodj/u9WL
 mEX3KERw744IEmWY0yXEAyvzKkkNqUUtmdUbbsnXnnEtzsVZ2oRmOZdXsvEW3vOD
 IEsjWY/405MBWyuBttAa6orBSgelr99k86HzoLN86s52HmliVDhr2EUnYIf2O++9
 SVhHFKE4BMVCO6hlyEg419K9M2VhWtBDNYweoXAfn8Y1byAw6Pt6WunjRuGwJG5n
 qvcrZcFCFSa3daPpx0uIA/yiSjZlq0hwVC3r/PnoX/r1FDR8tS2ecvC2rP3MaZJ+
 1x3IcNvwC97D80wvdW+f+qKtV4OXZefsZpzJJpvREH8FbAgYLDef0Q==
 =gln7
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.9

The fourth "new features" pull request for v6.9 with changes both in
stack and in drivers. The theme in this pull request is to fix sparse
warnings but we still have some left in wireless subsystem. Otherwise
quite normal.

Major changes:

rtw89
 * NL80211_EXT_FEATURE_SCAN_RANDOM_SN support
 * NL80211_EXT_FEATURE_SET_SCAN_DWELL support

rtw88
 * support for more rtw8811cu and rtw8821cu devices

mt76
 * mt76x2u: add Netgear WNDA3100v3 USB
 * mt7915: newer ADIE version support
 * mt7925: radio temperature sensor support
 * mt7996: remove GCMP IGTK offload

* tag 'wireless-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (125 commits)
  wifi: rtw89: wow: move release offload packet earlier for WoWLAN mode
  wifi: rtw89: wow: set security engine options for 802.11ax chips only
  wifi: rtw89: update suspend/resume for different generation
  wifi: rtw89: wow: update config mac function with different generation
  wifi: rtw89: update DMA function with different generation
  wifi: rtw89: wow: update WoWLAN status register for different generation
  wifi: rtw89: wow: update WoWLAN reason register for different chips
  wifi: brcm80211: handle pmk_op allocation failure
  wifi: rtw89: coex: Add coexistence policy to decrease WiFi packet CRC-ERR
  wifi: rtw89: coex: When Bluetooth not available don't set power/gain
  wifi: rtw89: coex: add return value to ensure H2C command is success or not
  wifi: rtw89: coex: Reorder H2C command index to align with firmware
  wifi: rtw89: coex: add BTC ctrl_info version 7 and related logic
  wifi: rtw89: coex: add init_info H2C command format version 7
  wifi: rtw89: 8922a: add coexistence helpers of SW grant
  wifi: rtw89: mac: add coexistence helpers {cfg/get}_plt
  wifi: cw1200: restore endian swapping
  wifi: wlcore: sdio: Rate limit wl12xx_sdio_raw_{read,write}() failures warns
  wifi: rtlwifi: Remove rtl_intf_ops.read_efuse_byte
  wifi: rtw88: 8821c: Fix false alarm count
  ...
====================

Link: https://lore.kernel.org/r/20240308100429.B8EA2C433F1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-08 09:05:49 -08:00
Luiz Augusto von Dentz
3d1c16e920 Bluetooth: hci_sync: Fix UAF in hci_acl_create_conn_sync
This fixes the following error caused by hci_conn being freed while
hcy_acl_create_conn_sync is pending:

==================================================================
BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0xa7/0x2e0
Write of size 2 at addr ffff888002ae0036 by task kworker/u3:0/848

CPU: 0 PID: 848 Comm: kworker/u3:0 Not tainted 6.8.0-rc6-g2ab3e8d67fc1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38
04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x21/0x70
 print_report+0xce/0x620
 ? preempt_count_sub+0x13/0xc0
 ? __virt_addr_valid+0x15f/0x310
 ? hci_acl_create_conn_sync+0xa7/0x2e0
 kasan_report+0xdf/0x110
 ? hci_acl_create_conn_sync+0xa7/0x2e0
 hci_acl_create_conn_sync+0xa7/0x2e0
 ? __pfx_hci_acl_create_conn_sync+0x10/0x10
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_hci_acl_create_conn_sync+0x10/0x10
 hci_cmd_sync_work+0x138/0x1c0
 process_one_work+0x405/0x800
 ? __pfx_lock_acquire+0x10/0x10
 ? __pfx_process_one_work+0x10/0x10
 worker_thread+0x37b/0x670
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x19b/0x1e0
 ? kthread+0xfe/0x1e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2f/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Allocated by task 847:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 __kasan_kmalloc+0x8f/0xa0
 hci_conn_add+0xc6/0x970
 hci_connect_acl+0x309/0x410
 pair_device+0x4fb/0x710
 hci_sock_sendmsg+0x933/0xef0
 sock_write_iter+0x2c3/0x2d0
 do_iter_readv_writev+0x21a/0x2e0
 vfs_writev+0x21c/0x7b0
 do_writev+0x14a/0x180
 do_syscall_64+0x77/0x150
 entry_SYSCALL_64_after_hwframe+0x6c/0x74

Freed by task 847:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0xfa/0x150
 kfree+0xcb/0x250
 device_release+0x58/0xf0
 kobject_put+0xbb/0x160
 hci_conn_del+0x281/0x570
 hci_conn_hash_flush+0xfc/0x130
 hci_dev_close_sync+0x336/0x960
 hci_dev_close+0x10e/0x140
 hci_sock_ioctl+0x14a/0x5c0
 sock_ioctl+0x58a/0x5d0
 __x64_sys_ioctl+0x480/0xf60
 do_syscall_64+0x77/0x150
 entry_SYSCALL_64_after_hwframe+0x6c/0x74

Fixes: 45340097ce ("Bluetooth: hci_conn: Only do ACL connections sequentially")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-03-08 11:06:14 -05:00
Frédéric Danis
2ab3e8d67f Bluetooth: Fix eir name length
According to Section 1.2 of Core Specification Supplement Part A the
complete or short name strings are defined as utf8s, which should not
include the trailing NULL for variable length array as defined in Core
Specification Vol1 Part E Section 2.9.3.

Removing the trailing NULL allows PTS to retrieve the random address based
on device name, e.g. for SM/PER/KDU/BV-02-C, SM/PER/KDU/BV-08-C or
GAP/BROB/BCST/BV-03-C.

Fixes: f61851f64b ("Bluetooth: Fix append max 11 bytes of name to scan rsp data")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-03-08 10:22:17 -05:00
Eric Dumazet
155549a668 ipv6: remove RTNL protection from inet6_dump_addr()
We can now remove RTNL acquisition while running
inet6_dump_addr(), inet6_dump_ifmcaddr()
and inet6_dump_ifacaddr().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 11:15:36 +00:00
Eric Dumazet
9cc4cc329d ipv6: use xa_array iterator to implement inet6_dump_addr()
inet6_dump_addr() can use the new xa_array iterator
for better scalability.

Make it ready for RCU-only protection.
RTNL use is removed in the following patch.

Also properly return 0 at the end of a dump to avoid
and extra recvmsg() to get NLMSG_DONE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 11:15:36 +00:00
Eric Dumazet
46f5182dd7 ipv6: make in6_dump_addrs() lockless
in6_dump_addrs() is called with RCU protection.

There is no need holding idev->lock to iterate through unicast addresses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 11:15:35 +00:00
Eric Dumazet
f0a7da7020 ipv6: make inet6_fill_ifaddr() lockless
Make inet6_fill_ifaddr() lockless, and add approriate annotations
on ifa->tstamp, ifa->valid_lft, ifa->preferred_lft, ifa->ifa_proto
and ifa->rt_priority.

Also constify 2nd argument of inet6_fill_ifaddr(), inet6_fill_ifmcaddr()
and inet6_fill_ifacaddr().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 11:15:35 +00:00
David S. Miller
3dbf6d67f2 ipsec-next-2024-03-06
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmXoQdQACgkQrB3Eaf9P
 W7dnTQ//RnTEaOPgTsHzhSwVOfWhsWkHx2xqUAlPNY8W2jrzxGgAIknPzobivvRJ
 U2bYPXDocDHUJAHIELUlu+lzATEz8baBN5zK5a+pPx5hXJlf5UI95linNZ5rEIiV
 RoxLpicnJqtWn1oMZ8d7Y0CknsLR/f4ruiVApzoifk1JaXC/zX8FcqqKsSPwVlqA
 GKy4+f71rNrIE9fbBAqDpmt6RuyRp/5yXPHLBoZlEXfYrYU1JOG8b/HLtGMD0SzV
 yHbDcgRPtbkWgAwNO/zxSDKa+PZr7NbVgakDzyHK+TltpU+6cOsajCaSXHWwsTBB
 +AebDschYY1H49oQe4bwLbNdGY+4lFvXxtk02sa8eM5a104MWxxTEB1QGAEri6gQ
 biAh3xTTbDpls26qkm97iZ6LlDE6pVIzF744buOYedvR8gjjoLt1z1PId05wMYGB
 A/4P6WkM8I1CZL++ODVfT8qR2N6lwFAQ6AM/eqHLvc6QpZ5Hm3lQAdLz1tK6QlCP
 MIV9uuNz8dFPrX1QifmLGojjdedB+4ASglxffOaoqRpHnMgHgzWTOux8tSFpuJGu
 mIYO/Dv5sHMdH8Jm+xXX1549bRzR+KGuqjXPxOSiO1jbOb5VC5ZDd3LVWb7fpDid
 K4eaU4Bo4R3eiCo1Bapt/1jKV1YFuyBKqTvObCDslVuN3Fu9d7I=
 =e4aa
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
1) Introduce forwarding of ICMP Error messages. That is specified
   in RFC 4301 but was never implemented. From Antony Antony.

2) Use KMEM_CACHE instead of kmem_cache_create in xfrm6_tunnel_init()
   and xfrm_policy_init(). From Kunwu Chan.

3) Do not allocate stats in the xfrm interface driver, this can be done
   on net core now. From Breno Leitao.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:56:05 +00:00
Ido Schimmel
5072ae00ae net: nexthop: Expose nexthop group HW stats to user space
Add netlink support for reading NH group hardware stats.

Stats collection is done through a new notifier,
NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
a given NH group are thereby asked to collect the stats and report back to
core by calling nh_grp_hw_stats_report_delta(). This is similar to what
netdevice L3 stats do.

Besides exposing number of packets that passed in the HW datapath, also
include information on whether any driver actually realizes the counters.
The core can tell based on whether it got any _report_delta() reports from
the drivers. This allows enabling the statistics at the group at any time,
with drivers opting into supporting them. This is also in line with what
netdevice L3 stats are doing.

So as not to waste time and space, tie the collection and reporting of HW
stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.

Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kees Cook <keescook@chromium.org> # For the __counted_by bits
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:47 +00:00
Ido Schimmel
746c19a52e net: nexthop: Add ability to enable / disable hardware statistics
Add netlink support for enabling collection of HW statistics on nexthop
groups.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:47 +00:00
Ido Schimmel
5877786fcf net: nexthop: Add hardware statistics notifications
Add hw_stats field to several notifier structures to communicate to the
drivers that HW statistics should be configured for nexthops within a given
group.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:47 +00:00
Ido Schimmel
95fedd7685 net: nexthop: Expose nexthop group stats to user space
Add netlink support for reading NH group stats.

This data is only for statistics of the traffic in the SW datapath. HW
nexthop group statistics will be added in the following patches.

Emission of the stats is keyed to a new op_stats flag to avoid cluttering
the netlink message with stats if the user doesn't need them:
NHA_OP_FLAG_DUMP_STATS.

Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:47 +00:00
Ido Schimmel
f4676ea74b net: nexthop: Add nexthop group entry stats
Add nexthop group entry stats to count the number of packets forwarded
via each nexthop in the group. The stats will be exposed to user space
for better data path observability in the next patch.

The per-CPU stats pointer is placed at the beginning of 'struct
nh_grp_entry', so that all the fields accessed for the data path reside
on the same cache line:

struct nh_grp_entry {
        struct nexthop *           nh;                   /*     0     8 */
        struct nh_grp_entry_stats * stats;               /*     8     8 */
        u8                         weight;               /*    16     1 */

        /* XXX 7 bytes hole, try to pack */

        union {
                struct {
                        atomic_t   upper_bound;          /*    24     4 */
                } hthr;                                  /*    24     4 */
                struct {
                        struct list_head uw_nh_entry;    /*    24    16 */
                        u16        count_buckets;        /*    40     2 */
                        u16        wants_buckets;        /*    42     2 */
                } res;                                   /*    24    24 */
        };                                               /*    24    24 */
        struct list_head           nh_list;              /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct nexthop *           nh_parent;            /*    64     8 */

        /* size: 72, cachelines: 2, members: 6 */
        /* sum members: 65, holes: 1, sum holes: 7 */
        /* last cacheline: 8 bytes */
};

Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:46 +00:00
Petr Machata
a207eab103 net: nexthop: Add NHA_OP_FLAGS
In order to add per-nexthop statistics, but still not increase netlink
message size for consumers that do not care about them, there needs to be a
toggle through which the user indicates their desire to get the statistics.
To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
use the attribute for carrying of arbitrary operation-specific flags, i.e.
not make it specific for get / dump.

Add the new attribute to get and dump policies, but do not actually allow
any flags yet -- those will come later as the flags themselves are defined.
Add the necessary parsing code.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:46 +00:00
Petr Machata
2118f9390d net: nexthop: Adjust netlink policy parsing for a new attribute
A following patch will introduce a new attribute, op-specific flags to
adjust the behavior of an operation. Different operations will recognize
different flags.

- To make the differentiation possible, stop sharing the policies for get
  and del operations.

- To allow querying for presence of the attribute, have all the attribute
  arrays sized to NHA_MAX, regardless of what is permitted by policy, and
  pass the corresponding value to nlmsg_parse() as well.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:35:46 +00:00
Jakub Kicinski
6025b9135f net: dqs: add NIC stall detector based on BQL
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:23:26 +00:00
Jakub Kicinski
92f8b1f5ca netdev: add queue stat for alloc failures
Rx alloc failures are commonly counted by drivers.
Support reporting those via netdev-genl queue stats.

Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:13:26 -08:00
Jakub Kicinski
ab63a2387c netdev: add per-queue statistics
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.

Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.

The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).

Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:13:25 -08:00
Eric Dumazet
ce7f49ab74 net: move rps_sock_flow_table to net_hotdata
rps_sock_flow_table and rps_cpu_mask are used in fast path.

Move them to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-19-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
490a79faf9 net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
df51b84564 ipv6: move tcp_ipv6_hash_secret and udp_ipv6_hash_secret to net_hotdata
Use a 32bit hole in "struct net_offload" to store
the remaining 32bit secrets used by TCPv6 and UDPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-17-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
5af674bb90 ipv6: move inet6_ehash_secret and udp6_ehash_secret into net_hotdata
"struct inet6_protocol" has a 32bit hole in 32bit arches.

Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-16-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
6e0735723a inet: move inet_ehash_secret and udp_ehash_secret into net_hotdata
"struct net_protocol" has a 32bit hole in 32bit arches.

Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
571bf020be inet: move tcp_protocol and udp_protocol to net_hotdata
These structures are read in rx path, move them to net_hotdata
for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-14-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
4ea0875b9d ipv6: move tcpv6_protocol and udpv6_protocol to net_hotdata
These structures are read in rx path, move them to net_hotdata
for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-13-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
6a55ca6b01 udp: move udpv4_offload and udpv6_offload to net_hotdata
These structures are used in GRO and GSO paths.
Move them to net_hodata for better cache locality.

v2: udpv6_offload definition depends on CONFIG_INET=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Eric Dumazet
aa70d2d16f net: move skbuff_cache(s) to net_hotdata
skbuff_cache, skbuff_fclone_cache and skb_small_head_cache
are used in rx/tx fast paths.

Move them to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
71c0de9bac net: move dev_rx_weight to net_hotdata
dev_rx_weight is read from process_backlog().

Move it to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
26722dc74b net: move dev_tx_weight to net_hotdata
dev_tx_weight is used in tx fast path.

Move it to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
0139806eeb net: move tcpv4_offload and tcpv6_offload to net_hotdata
These are used in TCP fast paths.

Move them into net_hotdata for better cache locality.

v2: tcpv6_offload definition depends on CONFIG_INET

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
61a0be1a53 net: move ip_packet_offload and ipv6_packet_offload to net_hotdata
These structures are used in GRO and GSO paths.

v2: ipv6_packet_offload definition depends on CONFIG_INET

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
edbc666cdc net: move netdev_max_backlog to net_hotdata
netdev_max_backlog is used in rx fat path.

Move it to net_hodata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Eric Dumazet
0b91fa4bfb net: move ptype_all into net_hotdata
ptype_all is used in rx/tx fast paths.

Move it to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:41 -08:00
Eric Dumazet
f59b5416c3 net: move netdev_tstamp_prequeue into net_hotdata
netdev_tstamp_prequeue is used in rx path.

Move it to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:41 -08:00
Eric Dumazet
ae6e22f7b7 net: move netdev_budget and netdev_budget to net_hotdata
netdev_budget and netdev_budget are used in rx path (net_rx_action())

Move them into net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:41 -08:00
Eric Dumazet
2658b5a8a4 net: introduce struct net_hotdata
Instead of spreading networking critical fields
all over the places, add a custom net_hotdata
structure so that we can precisely control its layout.

In this first patch, move :

- gro_normal_batch used in rx (GRO stack)
- offload_base used in rx and tx (GRO and TSO stacks)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:41 -08:00
Jakub Kicinski
c3874bbec9 rxrpc changes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmXnrwoACgkQ+7dXa6fL
 C2vtoA//UstImRxievynpzfaFy14vCY+qVgW3n88sDT9ITP/jCQH/hMorIft7B2K
 W7xN15OrWJ7SmfSSiWgZRTPgc0y5J/25woinZk9zSrFyOPfYXEJUy/Rw2VpQQ1eU
 lN1ArHvSzj/wWYZjb5FfzNds2mO9dpSvgP8gsBPaN/49OX4ZDpfwQO0ho20c8qE3
 VEUAoEPSYFd9EUJ4NBvXhn5q0Z1dp/tl3sQZdOFIN3EZnQBloPFFkvBbDTuX2jtK
 Q7F6cgPc/BWfQ+U4OwyPOA3QBrTPMmWCUtM1QCvYV+cj1O7xGaof372Fqin0E7z9
 NVIN4fVk3SIW/LRLhpyHoVOc0jHIona9gAsawfRzSIz7sPdeoXy0oOFHyKWrf3dH
 TdrdnPPkUaGLPbTv2/l6lzz+uCkMlQ6BOZV+y6gm1+OcM7BEvYBeA6RShPKEk0vo
 XsCwn6Pl9y8TZTOpcjal0b+M6mR5Nz0hYzcdCjmBc3lY8ueURPdvaNyLgmAtMu5c
 udsq8yMxaXyc3BblrtTTdV+sqdycTfU6/CoNY44uS4X9H+MsFYBgoba+HSvekUWj
 b/DCE2YfjQXdRmdlfGhyHSjnvOZohd8wq7i9TEOc8PSNXHo0CV/U8Q6pgsRrwSpG
 7DecipkrdFPYqzOl8PRIwWg/VwNGEI5ESzDD9Svs+cODSYJXJjE=
 =Egry
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
Here are some changes to AF_RXRPC:

 (1) Cache the transmission serial number of ACK and DATA packets in the
     rxrpc_txbuf struct and log this in the retransmit tracepoint.

 (2) Don't use atomics on rxrpc_txbuf::flags[*] and cache the intended wire
     header flags there too to avoid duplication.

 (3) Cache the wire checksum in rxrpc_txbuf to make it easier to create
     jumbo packets in future (which will require altering the wire header
     to a jumbo header and restoring it back again for retransmission).

 (4) Fix the protocol names in the wire ACK trailer struct.

 (5) Strip all the barriers and atomics out of the call timer tracking[*].

 (6) Remove atomic handling from call->tx_transmitted and
     call->acks_prev_seq[*].

 (7) Don't bother resetting the DF flag after UDP packet transmission.  To
     change it, we now call directly into UDP code, so it's quick just to
     set it every time.

 (8) Merge together the DF/non-DF branches of the DATA transmission to
     reduce duplication in the code.

 (9) Add a kvec array into rxrpc_txbuf and start moving things over to it.
     This paves the way for using page frags.

(10) Split (sub)packet preparation and timestamping out of the DATA
     transmission function.  This helps pave the way for future jumbo
     packet generation.

(11) In rxkad, don't pick values out of the wire header stored in
     rxrpc_txbuf, buf rather find them elsewhere so we can remove the wire
     header from there.

(12) Move rxrpc_send_ACK() to output.c so that it can be merged with
     rxrpc_send_ack_packet().

(13) Use rxrpc_txbuf::kvec[0] to access the wire header for the packet
     rather than directly accessing the copy in rxrpc_txbuf.  This will
     allow that to be removed to a page frag.

(14) Switch from keeping the transmission buffers in rxrpc_txbuf allocated
     in the slab to allocating them using page fragment allocators.  There
     are separate allocators for DATA packets (which persist for a while)
     and control packets (which are discarded immediately).

     We can then turn on MSG_SPLICE_PAGES when transmitting DATA and ACK
     packets.

     We can also get rid of the RCU cleanup on rxrpc_txbufs, preferring
     instead to release the page frags as soon as possible.

(15) Parse received packets before handling timeouts as the former may
     reset the latter.

(16) Make sure we don't retransmit DATA packets after all the packets have
     been ACK'd.

(17) Differentiate traces for PING ACK transmission.

(18) Switch to keeping timeouts as ktime_t rather than a number of jiffies
     as the latter is too coarse a granularity.  Only set the call timer at
     the end of the call event function from the aggregate of all the
     timeouts, thereby reducing the number of timer calls made.  In future,
     it might be possible to reduce the number of timers from one per call
     to one per I/O thread and to use a high-precision timer.

(19) Record RTT probes after successful transmission rather than recording
     it before and then cancelling it after if unsuccessful[*].  This
     allows a number of calls to get the current time to be removed.

(20) Clean up the resend algorithm as there's now no need to walk the
     transmission buffer under lock[*].  DATA packets can be retransmitted
     as soon as they're found rather than being queued up and transmitted
     when the locked is dropped.

(21) When initially parsing a received ACK packet, extract some of the
     fields from the ack info to the skbuff private data.  This makes it
     easier to do path MTU discovery in the future when the call to which a
     PING RESPONSE ACK refers has been deallocated.

[*] Possible with the move of almost all code from softirq context to the
    I/O thread.

Link: https://lore.kernel.org/r/20240301163807.385573-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20240304084322.705539-1-dhowells@redhat.com/ # v2

* tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (21 commits)
  rxrpc: Extract useful fields from a received ACK to skb priv data
  rxrpc: Clean up the resend algorithm
  rxrpc: Record probes after transmission and reduce number of time-gets
  rxrpc: Use ktimes for call timeout tracking and set the timer lazily
  rxrpc: Differentiate PING ACK transmission traces.
  rxrpc: Don't permit resending after all Tx packets acked
  rxrpc: Parse received packets before dealing with timeouts
  rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags
  rxrpc: Use rxrpc_txbuf::kvec[0] instead of rxrpc_txbuf::wire
  rxrpc: Move rxrpc_send_ACK() to output.c with rxrpc_send_ack_packet()
  rxrpc: Don't pick values out of the wire header when setting up security
  rxrpc: Split up the DATA packet transmission function
  rxrpc: Add a kvec[] to the rxrpc_txbuf struct
  rxrpc: Merge together DF/non-DF branches of data Tx function
  rxrpc: Do lazy DF flag resetting
  rxrpc: Remove atomic handling on some fields only used in I/O thread
  rxrpc: Strip barriers and atomics off of timer tracking
  rxrpc: Fix the names of the fields in the ACK trailer struct
  rxrpc: Note cksum in txbuf
  rxrpc: Convert rxrpc_txbuf::flags into a mask and don't use atomics
  ...
====================

Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 20:59:58 -08:00